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Preface 



Biometric authentication is increasingly gaining popularity in a large spectrum 
of applications, ranging from government programs (e.g., national ID cards, visas 
for international travel, and the fight against terrorism) to personal applications 
such as logical and physical access control. Although a number of effective so- 
lutions are currently available, new approaches and techniques are necessary to 
overcome some of the limitations of current systems and to open up new frontiers 
in biometric research and development. The 30 papers presented at Biometric 
Authentication Workshop 2004 (BioAW 2004) provided a snapshot of current 
research in biometrics, and identify some new trends. This volume is composed 
of five sections: face recognition, fingerprint recognition, template protection and 
security, other biometrics, and fusion and multimodal biometrics. For classical 
biometrics like fingerprint and face recognition, most of the papers in Sect. 1 
and 2 address robustness issues in order to make the biometric systems work 
in suboptimal conditions: examples include face detection and recognition un- 
der uncontrolled lighting and pose variations, and fingerprint matching in the 
case of severe skin distortion. Benchmarking and interoperability of sensors and 
liveness detection are also topics of primary interest for fingerprint-based sys- 
tems. Biometrics alone is not the solution for complex security problems. Some 
of the papers in Sect. 3 focus on designing secure systems; this requires dealing 
with safe template storage, checking data integrity, and implementing solutions 
in a privacy-preserving fashion. The match-on-tokens approach, provided that 
current accuracy and cost limitations can be satisfactorily solved by using new 
algorithms and hardware, is certainly a promising alternative. The use of new 
biometric indicators like eye movement, 3D finger shape, and soft traits (e.g., 
height, weight and age) is investigated by some of the contributions in Sect. 4 
with the aim of providing alternative choices for specific environments and ap- 
plications. Improvements and new ideas are also presented for other popular 
biometrics like iris, palmprints and signature recognition. Multimodal biomet- 
rics has been identified as a promising area; the papers in Sect. 5 explore some 
insights into this topic, and they provide novel approaches for combinations at 
sensor, feature extraction and matching score levels. 
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Abstract. ICA (Independent Component Analysis) is contrasted with PCA 
(Principal Component Analysis) in that ICA basis images are spatially 
localized, highlighting salient feature regions corresponding to eyes, eye brows, 
nose and lips. However, ICA basis images do not display perfectly local 
characteristic in the sense that pixels that do not belong to locally salient feature 
regions still have some weight values. These pixels in the non-salient regions 
contribute to the degradation of the recognition performance. We have proposed 
a novel method based on ICA that only employ locally salient information. The 
new method effectively implements the idea of “recognition by parts” for the 
problem of face recognition. Experimental results using AT&T, Harvard, 
FERET and AR databases show that the recognition performance of the 
proposed method outperforms that of PCA and ICA methods especially in the 
cases of facial images that have partial occlusions and local distortions such as 
changes in facial expression and at low dimensions. 



1 Introduction 

Over the last ten years, canonical subspace projection techniques such as PCA and 
ICA are widely used in the face recognition research [2-4]. These techniques employ 
feature vectors consisting of coefficients that are obtained by projecting facial images 
onto their basis images. The basis images are computed offline from a set of training 
images. ICA is contrasted with PCA in that ICA basis images are more spatially local 
than PCA basis images. Fig. 1 (a) and (b) show facial image representation using PCA 
and ICA basis images, respectively, that are computed from a set of images randomly 
selected from the AR database. PCA basis images display global properties in the 
sense that they assign significant weights to the same pixels. It accords with the fact 
that PCA basis images are just scaled versions of global Fourier filters [21]. In 
contrast, ICA basis images are spatially more localized, highlighting salient feature 
regions corresponding to eyes, eye brows, nose and lips. This local property of ICA 
basis images makes the performance of ICA based recognition methods better than 
PCA methods in terms of robustness to partial occlusions and local distortions such as 
changes in facial expression. Thus, ICA techniques have popularly been applied to the 
problem of face recognition [3-6], especially for face recognition under variations of 
illumination, pose and facial expression. However, ICA basis images do not display 
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perfectly local characteristics in the sense that pixels that do not belong to locally 
salient feature regions still have some weight values. These pixels in the non-salient 
regions contribute to the degradation of the recognition performance. 

We propose a novel method based on ICA, named LS-ICA (locally salient ICA) 
where the concept of “recognition by parts” [18-20] can be effectively realized for 
face recognition. The idea of “recognition by parts” has been a popular paradigm in 
object recognition research that can be successfully applied to the problem of object 
recognition with occlusion. Our method is characterized by two ideas: one is removal 
of non-salient regions in ICA basis images so that LS-ICA basis images only employ 
locally salient feature regions. The other is to use ICA basis images in the order of 
class separability so as to maximize the recognition performance. Experimental 
results show that LS-ICA performs better than PCA and ICA especially in the cases of 
partial occlusions and local distortions such as changes in facial expression. In 
addition, the performance improvement of LS-ICA over ICA based methods was 
much greater as we decrease the dimensionality (i. e. the number of basis images 
used). 



The rest of this paper is organized as follows. Section 2 contrasts ICA with PCA in 
terms of locality of features. Section 3 describes the proposed LS-ICA method. 
Section 4 presents experimental results. 




(a) PCA representation = (e l ,e 2 ,e 3 ,e 4 ,e 5 ,...,e n ) 




(c) LS-ICA representation = (bj.b:,,b':,b' l .b',.....b' i ) 



Figure 1 . Facial image representations using (a) PCA, (b) ICA and (c) LS-ICA basis images: A 
face is represented as a linear combination of basis images. The basis images were computed 
from a set of images randomly selected from the AR database. In the basis images of LS-ICA, 
non-salient regions of ICA basis images are removed. Using LS-ICA basis images, the concept 
of “recognition by parts” can be effectively implemented for face recognition. 



2 ICA Versus PCA 

PCA and ICA are the most widely used subspace projection techniques that project 
data from a high-dimensional space to a lower-dimensional space [2, 4]. PCA 
addresses only second-order moments of the input. It is optimal for finding a reduced 
representation that minimizes the reconstruction error, but it is not optimal for 
classification. ICA is a generalization of PCA that decorrelates the high-order 
statistics in addition to the second-order moments. Much of information about 
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characteristic local structure of facial images is contained in the higher-order statistics 
of the images. Thus ICA, where the high-order statistics are decorrelated, may 
provide a more powerful representational basis for face recognition than PCA, where 
only the second-order statistics are correlated. Figure 2 illustrates PCA and ICA axes 
for the same 2D distribution. PCA finds an orthogonal set of axes pointing in the 
directions of maximum covariance in the data, while ICA attempts to place axes 
pointing in the directions of spatially localized and statistically independent basis 
vectors [17]. 

As previously described, global properties of faces may be more easily captured by 
PCA than ICA. As shown in Figure 1, ICA basis images are more spatially localized 
and never overlap unlike their PCA counterpart [1]. Since spatially localized features 
only influence small parts of facial images, ICA based recognition methods are less 
susceptible to occlusions and local distortions than are global feature based methods 
such as PCA. We can compute ICA basis images using various algorithms such as 
InfoMax [3, 10], FastICA [5, 8] and Maximum likelihood [6, 7]. 





Figure 2. PCA and ICA axes for an identical 2D data distribution [17] 



3 The LS-ICA (Locally Salient ICA) Method 

The LS-ICA method features the use of new basis images made from ICA basis 
images that are selected in the decreasing order of class separability. Only salient 
feature regions are contained in the LS-ICA basis images. As in most algorithms that 
employ subspace projection, the LS-ICA method computes a projection matrix, off- 
line from a set of training images. Let W k _ ica denote the projection matrix. The 
columns of W ls _ ica are LS-ICA basis images. During recognition, given an input face 
image x , it is projected to = W^_ slca x and classified by comparison with the vectors 
Q r ’s that were computed off-line from a set of training images. 

Figure 3 shows a block diagram of the method. First, we preprocess training images 
by applying histogram equalization and scale normalization, where the size of images 
is adjusted so that they have the same distance between two eyes. Second, we com- 
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Order ICA basis vectors in terms of L 
class separability 



Take the desired number of ICA basis vectors & 
convert them into LS-ICA basis vectors ( W, a ) 



i 



Take linear 
transformation 

Q = Wj_ ica X 



Compute 

Q = for 

the training images 



I 



Find the nearest 



neighbor of Cl' 
among n.,Q,,...,and Q. 



Recognition result 



Figure 3. Algorithm overview 

pute ICA basis images, using the FastICA algorithm [5, 8]. The FastICA method 
computes the independent components that become uncorrelated by a whitening 
process and then maximizes non-Gaussianity of data distribution by using kurtosis 
maximization [5]. We then compute a measure of class separability, r , for each ICA 
basis vector and sort the ICA basis vectors in the decreasing order of class separability 
[3]. To computer r for each ICA basis vector, the between-class variability a between 

and within-class variability <J ithi of its corresponding projection coefficients of 
training images are obtained as follows. 

(1) 

(2) 



M and M are the total mean and the mean of each class, and b . is the coefficient of 

' v 

the j‘ h training image in class i . The class separability, r , is then defined as the 
ratio 



( 3 ) 
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Third, we create LS-ICA basis images from the ICA basis images selected in the 
decreasing order of the class separability. This way, we can achieve both dimensional 
reduction and good recognition performance. To create an LS-ICA basis image, we 
apply a series of operations to its corresponding ICA basis image as shown in Figure 
4. In order to detect locally salient regions, we simply find extreme values by 
thresholding a histogram of pixel values (Figure 4 (b)), followed by the application of 
morphological operations to find a blob region (Figure 4 (d)). As a result, we get an 
LS-ICA basis image (Figure 4 (e)) where only pixels in the blob regions have grey 
values copied from the corresponding pixels in the original ICA image. The values of 
the rest of the pixels in the image are set to zero. These LS-ICA basis images are used 
to represent facial images as shown in Figure 1 (c). 




Figure 4. Illustration of creating an LS-ICA basis image 



4 Experimental Results 

We have used several facial image databases such as AT&T [13], Harvard [14], 
FERET [15] and AR [16] databases in order to compare the recognition performance 
of LS-ICA with that of PCA and ICA methods. For fair comparisons with PCA and 
ICA based methods, PCA and ICA basis images were also used in the decreasing 
order of class separability, r . In the case of the ICA method, the recognition 
performance was greater when the basis images were ordered in terms of class 
separability. However, the PCA method did not show any noticeable performance 
difference between the ordering in the class separability and the orginal ordering in 
terms of eigenvalues. 

Table 1 lists the number of training and testing images used in each facial image 
databases for the experiment. Figure 5 shows example images from these databases. 
In the AT&T database, all the images are taken against a dark homogeneous 
background and the subjects are in an up-right, frontal position with tolerance for 
some side movement. In Harvard database, a subject held his/her head steady while 
being illuminated by a dominant light source. In the FERET database, we have used a 
subset of the images of subjects under significantly different lighting and facial 
expression. The AR database contains local distortions and occlusions such as 
changes in facial expression and sunglasses worn. 

All images were converted to 256 gray-level images and background regions were 
removed. We have also applied histogram equalization to both training and testing 
images in order to minimize variations of illumination. We have experimented using 
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thirty different sets of training and testing images for each database. We have 
computed recognition performances for three different distance measures (LI, L2, 
cosine) since we are concerned with performance variations independent of the 
distance measure used [1]. 



Table 1. Facial databases used in the experiment 



Database 


The number of 


The number 


The number of 


The number of 


total images 


of persons 


training images 


testing images 


AT&T 


400 


40 


200 


200 


Harvard 


165 


5 


82(83) 


83(82) 


FERET 


605 


127 


127 


478 


AR 


800 


100 


200 


600 



□sBissnnnroi 

nEIEEE 






0 0 0 5 



Figure 5. Example images from AT&T (top left), Flarvard (top right), FERET (bottom left) and 
AR (bottom right) facial databases 



Figure 6 shows the recognition performances of PCA, ICA and LS-ICA methods for 
the four facial databases. The recognition rate of the LS-ICA method was consistently 
better than that of PCA and ICA methods regardless of distance measures used. ICA 
also consistently outperformed PCA except the case where the LI measure was used 
for the FERET database. What is more interesting is that LS-ICA method performed 
better than the other methods especially at low dimensions. This property is very 
important when we need to store facial feature data in a low capacity storing devices 
such as smart cards and barcodes. To clearly show this, we displayed in Figure 7 the 
performance improvement of the LS-ICA method over the ICA method. The 
performance improvement was the greatest in the case of the AR database, as we 
expected. The AR database contains local distortions and occlusions such as 
sunglasses worn. The LS-ICA method that only makes use of locally salient 
information can achieve a higher recognition rate than ordinary ICA methods that are 
influenced by pixels not belonging to salient regions. The experimental results show 
that, especially at low dimensions, LS-ICA basis images better represent facial images 
than ICA basis images. 
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PCA ■ ICA -4- LS-ICA 




Figure 6. The recognition performance of PCA, ICA and LS-ICA methods for the four facial 
databases. The recognition rates represent the average performance of the experiment using 
thirty different sets of training and testing images. 



5 Conclusion and Future Research Directions 

We have proposed the LS-ICA method that only employs locally salient information 
in order to maximize the benefit of applying the idea of “recognition by parts” to the 
problem of face recognition under partial occlusion and local distortion. The 
performance of the LS-ICA method was consistently better than the ICA method 
regardless of the distance measures used. As expected, the effect was the greatest in 
the cases of facial images that have partial occlusions and local distortions such as 
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Figure 7. The performance improvement of LS-ICA method over ICA method for the four 
facial databases. The performance improvement was the greatest in the case of AR database 
that contains local distortions and occlusions such as sunglasses worn. 

changes in facial expression. However, we expect that a combination of the proposed 
LS-ICA method with a global feature based method such as PCA will yield better 
recognition rates since face recognition is a holistic process [19]. Further research 
efforts will be made to develop an optimal method that best combines two methods. 
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Abstract. We propose a new method for face recognition under arbi- 
trary pose and illumination conditions, which requires only one training 
image per subject. Furthermore, no limitation on the pose and illumina- 
tion conditions for the training image is necessary. Our method combines 
the strengths of Morphable models to capture the variability of 3D face 
shape and a spherical harmonic representation for the illumination. Mor- 
phable models are successful in 3D face reconstructions from one single 
image. Recent research demonstrates that the set of images of a convex 
Lambertian object obtained under a wide variety of lighting conditions 
can be approximated accurately by a low-dimensional linear subspace us- 
ing spherical harmonics representation. In this paper, we show that we 
can recover the 3D faces with texture information from one single train- 
ing image under arbitrary illumination conditions and perform robust 
pose and illumination invariant face recognition by using the recovered 
3D faces. During training, given an image under arbitrary illumination, 
we first compute the shape parameters from a shape error estimated 
by the displacements of a set of feature points. Then we estimate the 
illumination coefficients and texture information using the spherical har- 
monics illumination representation. The reconstructed 3D models serve 
as generative models to render sets of basis images of each subject for 
different poses. During testing, we recognize the face for which there ex- 
ists a weighted combination of basis images that is the closest to the 
test face image. We provide a series of experiments on approximately 
5000 images from the CMU-PIE database. We achieve high recognition 
rates for images under a wide range of illumination conditions, including 
multiple sources of illumination. 



1 Introduction 

Face recognition has recently received extensive attention as one of the most sig- 
nificant applications of image understanding. Although rapid progress has been 
made in this area during the last few years [29] [21] [16] [3] [35] [5] [19] [18] [8] [28] [24] 
[9] [20] [31] [33], the general task of recognition remains unsolved. In general, face 
appearance does not depend solely on identity. It is also influenced by illumina- 
tion and viewpoint. Changes in pose and illumination will cause large changes 
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in the appearance of a face. In this paper we demonstrate a new method to rec- 
ognize face images under a wide range of pose and illumination conditions using 
spherical harmonic images of the face and a morphable model. Our method 
requires only a single training image per subject. To our knowledge no other 
face recognition method can achieve such a high level of pose and illumination 
invariance when only one training image is available. 

In the past few years, there have been attempts to address image variation 
produced by changing in illumination and pose [10] [35]. Georglriades et al. [11] 
present a new method using the illumination cone which requires at least three 
images per subject to build the illumination cone. Romdhani et al. [23] recover 
the shape and texture parameters of a 3D Morphable Model in an analysis-by- 
synthesis fashion. In [23], the shape parameters are computed from a shape error 
estimated by optical flow and the texture parameters are obtained from a tex- 
ture error. The algorithm uses linear equations to recover the shape and texture 
parameters irrespective of pose and lighting conditions of the face image. How- 
ever, this method is bound to images taken under single directional illumination 
and requires the knowledge of light direction which is difficult to know in most 
cases. 

In general, appearance-based methods like Eigenfaces [29] and SLAM [21] 
need a number of training images for each subject, in order to cope with pose and 
illumination variability. Previous research suggests that illumination variability 
in face images is low-dimensional e.g. [2] [22] [12] [4] [1] [25] [26] [17]. Using spherical 
harmonics and signal-processing techniques, Basri et al. [2] and Ramamoorthi 
[22] have independently shown that the set of images of a convex Lambertian 
object obtained under a wide variety of lighting conditions can be approximated 
accurately by a 9 dimensional linear subspace. Furthermore, a simple scheme for 
face recognition with excellent results was described in [2] . However, to use this 
recognition scheme, the basis images spanning the illumination space for each 
face are required. These images can be rendered from a 3D scan of the face or can 
be estimated by applying PC A to a number of images of the same subject under 
different illuminations [22]. An effective approximation of this basis by 9 single 
light source images of a face was reported in [15] and Wang et al. [30] proposed 
a illumination modeling and normalization method for face recognition. The 
above mentioned methods need a number of training images and/or 3D scans 
of the subjects in the database, requiring specialized equipment and procedures 
for the capture of the training set, thus limiting their applicability. A promising 
earlier attempt by [36] used symmetric shape from shading but suffered from the 
drawbacks of SFS. A new approach is proposed in [34] for face recognition under 
arbitrary illumination conditions, for fixed pose, which requires only one training 
image per subject and no 3D shape information. In [34] the statistical model is 
based on a collection of 2D basis images, rendered from known 3D shapes. Thus 
3D shape is only implicitly included in the statistical model. Here we will base 
our statistical model directly on 3D shapes, perform statistical analysis in 3D in 
order to estimate the most appropriate 3D shape and then create the 2D basis 
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images. The ability to manipulate the 3D shape explicitly allows the generation 
of basis images for poses that clo not exist in the training data. 

In this paper we propose a method that combines a 3D morplrable model 
and a low-dimensional illumination representation that uses spherical harmon- 
ics. Our method requires only one training image for each subject without pose 
and illumination limitations. Our method consists of three steps: 3D face recon- 
struction, basis image rendering and recognition. Initially, similar to [23], given 
a training image, we compute the shape parameters of a morplrable model from 
a shape error estimated by the displacements of a set of feature points. Then we 
estimate the illumination coefficients and texture information using the spheri- 
cal harmonics illumination representation. In the basis image rendering step, the 
reconstructed face models then serve as generative models that can be used to 
synthesize sets of basis images under novel poses and spanning the illumination 
field. During the recognition step, we use the recognition scheme proposed by 
Basri et al. [2]. We return the face from the training set for which there exists a 
weighted combination of basis images that is the closest to the test face image. 

We use the morplrable model computed from the USF 3D face database [6] 
and the CMU-PIE database [27] for training and testing. We provide a series 
of experiments that show that the method achieves high recognition accuracy 
although our method requires only a single training image without limitation on 
pose and illumination conditions. We compare the recognition rate with [23] on 
the images taken under single light source. We also give experiment results of 
recognition on the set of images under multiple light sources, and compare with 
[34] for known pose. 

This paper is organized as follows. In the next section, we will briefly intro- 
duce the Morplrable Model. In Section 3, we explain the Spherical Harmonics 
and how to acquire basis images from 3D face models. In Section 4, we describe 
the process of 3D face model reconstruction and basis image rendering. In Sec- 
tion 5, we describe the recognition process that uses the rendered basis images. 
In Section 6, we describe our experiments and their results. The final Section 
presents the conclusions and future work directions. 



2 Morphable Model 

In this section we briefly summarize the morphable model framework described 
in detail in [6] [7]. The 3D Morphable Face Model is a 3D model of faces with 
separate shape and texture models that are learnt from a set of exemplar faces. 
Morphing between faces requires complete sets of correspondences between all 
of the faces. When building a 3D morphable model, we transform the shape and 
texture spaces into vector spaces, so that any convex combination of exemplar 
shapes and textures represents a realistic human face. We present the geometry 

of a face with a shape-vector S = (Xi,Yi,Zi,X 2 , ,Y n ,Z n ) T G 3? 3 ”, which 

contains the X , Y, Z- coordinates of its n vertices. Similarly, the texture of a face 

cau be represented by a texture-vector T = (Ri, G\, Hi, R 2 , , G ni B n ) T € 3? 3n 

where the R , G , B texture values are sampled at the same n points. A morphable 
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model can be constructed using a data set of m exemplar faces; exemplar i is 
represented by the shape- vector Si and texture- vector Tj. New shapes s and 
textures t can be generated by convex combinations of the shapes and textures of 
the m exemplar faces: s = Ya= i a i S 1 = YaLi b i T i> Ya=\ a i = Y1L1 b i = 1- To 
reduce the dimensionality of the shape and texture spaces, Principal Component 
Analysis(PCA) is applied separately on the shape and texture spaces: 

m — 1 m— 1 

s = s + ^2 oci(r St iSi, t = t+^2Pi<rt,iU ( 1 ) 

i= 1 2=1 

By setting the smallest eigenvalues to zero, Eq. 1 is reformulated as: 

s = s + >Sa, t = t + Tf3 (2) 

In Eq. 2 the columns of S and T are the most significant eigenvectors Sj and 
ti re-scaled by their standard deviation and the coefficients a and f3 constitute 
a pose and illumination invariant low-dimensional coding of a face [23]. PCA 
also provides an estimate of the probability densities of the shapes and textures, 
under a Gaussian assumption: p(s) ~ er^lMI , p(t) ~ 

3 Spherical Harmonics 

In this section, we will briefly explain the illumination representation by using 
spherical harmonics and how we render basis images from 3D models using the 
results of [2]. Let L denote the distant lighting distribution. By neglecting the 
cast shadows and near-field illumination, the irradiance E is then a function of 
the surface normal n only and is given by an integral over the upper hemisphere 
fi n [22]: E(n) = f L(u>)(n ■ u))du> We then scale E by the surface albedo A to 
find the radiosity /, which corresponds to the image intensity directly: 

I(p,n) = \(p)E(n) (3) 

Basri et al. [2] and Ramamoorthi [22] have independently shown that E can be 
approximated by the combination of the first nine spherical harmonics H(x, y, z) 
for Lambertian surfaces: 




where the superscripts e and o denote the even and the odd components of the 
harmonics respectively and x,y, z demote the cartesian components. Then the 
image intensity of a point p with surface normal n = (n x ,n y ,n z ) and albedo A 
can be computed according to Eq. 3 by replacing x,y,z with n x ,n y ,n z . Fig. 1 
gives an example of the mean shape and texture of the morplrable model under 
a spherical harmonics representation. 
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Fig. 1. The first image is the mean of the morphable model and the following nine 
images are the basis images under various view-points, represented by spherical 
harmonics. Lighter gray (0-127) represents positive values and darker gray (128- 
255) represents negative values. 



4 Face Model Reconstruction and Basis Image Rendering 

In this section, we will explain how we recover the shape and texture information 
of a training subject by combining a morphable model and spherical harmonics 
lighting representation. 



4.1 Forward and Inverse Face Rendering 

We can generate photo-realistic face images by using the morphable model we 
described in Section 2 [6] . Here we describe how we synthesize a new face image 
from the face shape and texture vectors s and t, thus, the inversion process of 
the synthesis is how we recover shape and texture information from the image. 
Shape: Similar to [23], a realistic face shape can be generated by: 

s 2d = fP 'R(S + Sa + t^d) + t2d (5) 

where / is a scale parameter, P an orthographic projection matrix and R a 
rotation matrix with <j>, 7 and 6 the three rotation angles for the three axes. 
t^d and t 2 d are translation vectors in 3D and 2D respectively. Eq. 5 relates the 
vector of 2D image coordinates s 2 d and the shape parameters a. For rendering, 
a visibility test must still be performed by using a z-buffer method. 

For a training image, inverting the rendering process, the shape parameters 
can be recovered from the shape error: if /, (/>, 7 and 6 are kept constant, the 
relation between the shape s 2 d and a is linear according to Eq. 5: = fPRS. 

Thus, updating a from a shape error Ss 2 d requires only the solution of a linear 
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system of equations. In our method, the shape error is estimated by the displace- 
ments of a set of manually picked feature points s / [14] corresponding to image 
coordinates s/ 9 . The shape reconstruction goes through the following steps: 

Model Initialization: All the parameters are initialized in this step. Shape 
parameter a is set to 0 and pose parameters / ,<j>, r y, 0 and t 2 d a re initialized 
manually. We do not need to know the illumination conditions of the training 
image, unlike [23]. 

Feature Correspondence: For the set of pre-picked feature points in the mor- 
plrable model, we find the correspondence s/ 9 in the training image semi- 
automatically. The set of feature points contains major and secondary features, 
see Fig. 2. After the correspondences of major features are manually set, the 
secondary features are updated automatically. 

Rotation, Translation and Scale Parameters Update: the parameters /, <fi, 7 
and 9 can be recovered by using a Levenberg-Marquardt optimization to mini- 
mize the error between s l J" 9 and the model feature points [13]: 

argminf^ a . e ,t 2d \\s l p 9 - {fPR{sf + S/a + t 3d ) + f2d)|| 2 = (/, <j>, 7, 6, t 2 d) (6) 

where s/ and Sf is the corresponding shape information of the feature points in 
the morplrable model in Eq. 2. 

Shape Parameter Update: The shape error of the feature points, 5s 2 /, is 
defined as the difference between s 1 / 9 and the new shape information of feature 
points in the model that was rendered by recovered parameters /,</>, 7,0 and 
t 2 d- Thus, the vector of shape parameters a can be updated by solving a linear 
system of equations: 

5s 2 / = fPRS f 5a (7) 

Texture: For texture information recovery, most of the previous methods 
[11] [23] are applicable to images taken under single light source, which limits 
their applicability. Here we propose a method which performs texture fitting to 
a training image and has no limitation in the image illumination conditions. 

According to Eq. 3 and 4, the texture of a face can be generated by: 

t = B*l , B = H(n x ,n y ,n z ) ■ A (8) 

where H is the spherical harmonics representation of the reflectance function 
(Eq. 4) and l is the vector of illumination coefficients. Hence, if we know the 
illumination coefficients, the texture information is only dependent on image 
intensity t and surface normal n, which can be computed from the 3D shape 
we recovered during the shape fitting step. The texture recovery is described as 
following: 

Basis Computation: The initial albedo A for each vertex is set to t. With 
the recovered shape information, we first compute the surface normal n for each 
vertex. Then the first nine basis images B and spherical harmonics H{n) for 
reflectance function can be computed according to Eq. 8 and 4 respectively. 
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Fig. 2. Recovery Results: Images in the first row are the input training images, 
those in the second row are the initial fittings, the third row shows images of 
the recovered 3D face model and the last row gives the illuminated rotated face 
models. In the first column, the black points are pre-picked major features, the 
white points are the corresponding features and the points lying in the white 
line are secondary features. 



Illumination Coefficients Estimation: The set of illumination coefficients l is 
updated by solving a linear system of equations: 

ttra = B curl (9) 

Texture Recovery: According to Eq. 8, the texture A for each visible vertex 
is computed by solving: t tra = H(n x ,n y ,n z )l ■ A. Since texture is dependent on 
both current texture and illumination coefficients, the new value of A is: 



A = (1 - rj) X cur + v(ttra/(H(n x ,n y ,n z )l)) 



(10) 
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We compute l and A by solving Eq. 9 and 10 iteratively. In our experiments, 
weight 77 is first set to 0.5, then incremented by 0.1 at each step until it reaches 
1. Instead of recovering texture parameters [23], we estimate the albedo value 
for each vertex, which will be used for basis image rendering and recognition. 
For occluded vertices, texture information is estimated through facial symmetry. 
Fig. 2 shows the results of our method. 

There is human interactivity in our shape fitting part since we manually find 
the correspondences for major features. Automatic shape fitting of a morplrable 
model [23] is beyond the scope of this paper which focuses on the statistics of 
interaction of geometry with arbitrary unknown illumination and the feature 
based method performed sufficiently well to demonstrate the strength of our 
approach. 

4.2 Basis Images Rendering 

For each training subject, we recover a 3D face model using the algorithm de- 
scribed in section 4.2. The recovered face models serve as generative models to 
render basis images. In this section, for each subject, a set of basis images across 
poses are generated, to be used during recognition. We sample the pose variance 
for each 5° in both azimuth and altitude axes. In our experiments, the range of 
azimuth is [-70,70] and the range of altitude is [-10,10]. Fig. 3 shows a subset of 
the basis images for one subject. 

5 Face Recognition 

In the basis image rendering step, for each subject i, a set of 145 (29*5) basis 
Bpj € [1..145] is rendered. During testing, given a new testing image It , we 
recognize the face of subject i for which there exists a weighted combination 
of basis images that is the closest to the test face image [2]: mini t j\\B^l — I t \\ 
where B l - is a set of basis images with size d * r, d is the number of points in 
the image and r the number of basis images used (9 is a natural choice, we also 
tried 4 in our experiments). Every column of B '*■ contains one spherical harmonic 
image, and the columns of B* form a basis for the linear subspace. To solve the 
equation, we simply apply QR decomposition to Bj to obtain an orthonormal 
basis. Thus, we compute the distance from the test image, It, and the space 
spanned by B) as \\QQ T I t - I t \\. 

6 Experiments and Results 

In our experiments, we used the CMU-PIE database which provides images 
of both pose and illumination variation. The CMU-PIE database contains 68 
individuals, none of which is also in the USF set used to compute the morplrable 
model. We performed experiments on a set of 4488 images which contains 68 
subjects, 3 poses for each subject and 22 different illuminations for each pose. 
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Fig. 3. A subset of the rendered basis images across poses. 



6.1 Experiments of Illumination Invariance 

Since our recognition method is based on the 3D face models recovered during 
training, it is important that the recovered face models and rendered basis images 
are robust. Figure 4 shows three sets of rendered basis images recovered from 
various face images under different illuminations for one subject. The resulting 
basis images rendered from images under different illumination are very close. 
For each subject, we calculated 10 sets of basis using 10 training images under 
different illumination. The per person mean variance of the 10 resulting sets 
of basis images was 3.32. For comparison, per person variance of the original 
training images was 20.25. That means the rendered basis images have much 
greater invariance to illumination effects than original images. 



6.2 Recognition Experiments 

In recognition experiments, we used the same set of 4488 images in CMU-PIE. 
We used only one image per subject to recover the 3D face model. We used the 
front and side galleries for training and all three pose galleries for testing. Notice 
that training images can have very different illumination conditions (unlike [23]). 
We performed recognition by using both all the 9 basis images and the first 4 
basis images. We report our experimental results and comparison to [23] in Table 
1. From the experimental results, we find that our method gives good recognition 
rates. When the poses of training and testing images are very different, our 
method is not as good as [23] because we only used a set of feature points to 
recover the shape information and the shape recovery is not accurate enough. 
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Table 1. Recognition results and comparison: The first column lists the light 
numbers and the following two columns list the recognition rate for each pose. 
The recognition rates of the LiST algorithm are taken from [23] . 



Light 


Front Gallery 


Front Gallery 


Side Gallery 


Side Gallery 




Using all 9 basis 


Using first 4 basis 


Using all 9 basis 


Using first 4 basis 




Front 


Side Profile 


Front Side Profile 


Front Side Profile 


Front Side Profile 


1 


95 


89 


51 


89 


81 


49 


91 


92 


52 


79 


78 


52 


2 


89 


81 


34 


79 


73 


31 


80 


83 


34 


67 


67 


33 
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97 


88 


44 


89 


79 
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92 


96 
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83 


86 


48 
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Fig. 4. Rendered basis images from training images taken under different illu- 
mination conditions. The first column shows the training images. 
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Table 2. Recognition results using various previous methods and our method 
on Yale Database B. Except for our method, the data were taken from [34] 



Methods 


Subset 1,2 


Subset3 


Subset4 


Eigenfaces 


100 


74.2 


24.3 


Linear Subspace 


100 


100 


85 


Cones-attached 


100 


100 


91.4 


9PL 


100 


100 


97.2 


Cones- cast 


100 


100 


100 


2D HIE 


100 


99.7 


96.9 


Our Method 


100 


100 


97.2 



We also performed experiments on the Yale Face Database B [3] and com- 
pared our recognition results with other methods for fixed frontal pose. The 
Yale Database contains images of 10 people with 9 poses and 64 illuminations 
per pose. We used 45*10 frontal face images for 10 subjects with each subject 
having 45 face images taken under different directional light sources. The data 
set is divided to 4 subsets following [15]. Table 2 compares our recognition rates 
with previous methods. As can be seen from Table 2, the results from our method 
are comparable with methods that require extensive training data per subject 
even though our method requires only one training image per pose. For fixed 
pose, the 2D HIE method in [34] performs almost as well as the our method, 
however the performance is very sensitive to accurate alignment of the faces. 

6.3 Multiple Directional Illumination 

As we mentioned, most of the previous methods are only applicable to single 
directional lighting. We study the performance of our method on images taken 
under multiple directional illumination sources to test our method under arbi- 
trary illuminations. We synthesized images by combining face images in our data 
set and performed experiments on front and side galleries. For each subject, we 
randomly selected 2-6 images from the training data set and combined them to- 
gether with random weights to simulate face images under multiple directional 
illumination sources(16 images per subject). We did experiments on the syn- 
thesized images both during training step and testing step. Table 3 shows the 
experimental results and we can see that our method also performed equally well 
under multiple sources of arbitrary direction. 

7 Conclusions and Future Work 

We have shown that by combining a morplrable model and spherical harmonic 
lighting representation, we can recover both shape and texture information from 
one single image taken under arbitrary illumination conditions. Experimental 
results indicate that our method’s recognition rates are comparable to other 
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Table 3. Experimental results of images under multiple directional illumination. 
” s” denotes images under single directional lighting and ” m” denotes synthesized 
images under multiple illumination. ”F” denotes the front gallery and ”D” de- 
notes the side gallery. 





Train: s; Test:s 


Train: m; Test:s 


Train: s; Testmi 


Train :m; Testun 


Train:F; Test:F 


98.2 


98.3 


97.8 


98.1 


Train:F; Test:D 


92.4 


92.0 


91.5 


92.2 


Train :D; Test:F 


94.4 


93.6 


94.2 


94.8 


Train: D; Test:D 


96.7 


95.9 


96.3 


96.1 



methods for pose variant images under single illumination. Moreover our method 
performs as well in the case of multiple illuminants, which is not handled by 
most previous methods. During the training phase, we only need one image 
per subject without illumination and pose limitations to recover the shape and 
texture information. Thus, the training set can be expanded easily with new 
subjects, which is desirable in a Face Recognition System. 

In our experiments, we tested both images under single- and multiple- direc- 
tional illuminations. At this time, there exist relatively few publicly available sets 
of images of faces under arbitrary illumination conditions, so we plan to continue 
validation of our method with a database with more types of light sources, e.g. 
area sources. There is human interactivity in the initialization of the model and 
the feature correspondences. We plan to integrate head pose estimation methods 
[32] for model initialization and optical flow algorithms for shape error estima- 
tion. In the face recognition phase, our method needs to search the whole pose 
space, we expect great speed-up with a pre- filter process (again using face pose 
estimation algorithms) to narrow the search space. 
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Abstract. The performance of face authentication systems has steadily 
improved over the last few years. State-of-the-art methods use the pro- 
jection of the gray-scale face image into a Linear Discriminant subspace 
as input of a classifier such as Support Vector Machines or Multi-layer 
Perceptrons. Unfortunately, these classifiers involve thousands of pa- 
rameters that are difficult to store on a smart-card for instance. Re- 
cently, boosting algorithms has emerged to boost the performance of 
simple (weak) classifiers by combining them iteratively. The famous Ad- 
aBoost algorithm have been proposed for object detection and applied 
successfully to face detection. In this paper, we investigate the use of 
AdaBoost for face authentication to boost weak classifiers based sim- 
ply on pixel values. The proposed approach is tested on a benchmark 
database, namely XM2VTS. Results show that boosting only hundreds 
of classifiers achieved near state-of-the-art results. Furthermore, the pro- 
posed approach outperforms similar work on face authentication using 
boosting algorithms on the same database. 



1 Introduction 

Identity authentication is a general task that has many real-life applications such 
as access control, transaction authentication (in telephone banking or remote 
credit card purchases for instance), voice mail, or secure teleworking. 

The goal of an automatic identity authentication system is to either accept 
or reject the identity claim made by a given person. Biometric identity authen- 
tication systems are based on the characteristics of a person, such as its face, 
fingerprint or signature. A good introduction to identity authentication can be 
found in [1]. Identity authentication using face information is a challenging re- 
search area that was very active recently, mainly because of its natural and 
non-intrusive interaction with the authentication system. 

The paper is structured as follow. In section 2 we first introduce the reader 
to the problem of face authentication. 
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Then, we present the proposed approach, boosting pixel-based classifiers for 
face authentication. We then compare our approach to state-of-the-results on the 
benchmark database XM2VTS. Finally, we analyze the results and conclude. 

2 Face Authentication 

2.1 Problem Description 

An identity authentication system has to deal with two kinds of events: either 
the person claiming a given identity is the one who he claims to be (in which 
case, he is called a client), or he is not (in which case, he is called an impostor). 
Moreover, the system may generally take two decisions: either accept the client 
or reject him and decide he is an impostor. 

The classical face authentication process can be decomposed into several 
steps, namely image acquisition (grab the images, from a camera or a VCR, 
in color or gray levels), image processing (apply filtering algorithms in order to 
enhance important features and to reduce the noise), face detection (detect and 
localize an eventual face in a given image) and finally face authentication itself, 
which consists in verifying if the given face corresponds to the claimed identity 
of the client. 

In this paper, we assume (as it is often done in comparable studies, but 
nonetheless incorrectly) that the detection step has been performed perfectly 
and we thus concentrate on the last step, namely the face authentication step. 
The problem of face authentication has been addressed by different researchers 
and with different methods. For a complete survey and comparison of different 
approaches see [2]. 

2.2 State-of-the-Art Methods 

The representation used to code input images in most state-of-the-art methods 
are often based on gray-scale face image [3, 4] or its projection into Principal 
Component subspace or Linear Discriminant subspace [5, 6]. In this section, we 
briefly introduce one of the best method [5] . 

Principal Component Analysis (PCA) identifies the subspace defined by the 
eigenvectors of the covariance matrix of the training data. The projection of face 
images into the coordinate system of eigenvectors (Eigenfaces) [7] associated 
with nonzero eigenvalues achieves information compression, decorrelation and 
dimensionality reduction to facilitate decision making. A Linear Discriminant 
is a simple linear projection y = b + w • x of the input vector onto an output 
dimension, where the estimated output y is a function of the input vector x, 
and the parameters {b, w} are chosen according to a given criterion such as the 
Fisher criterion [8]. A Linear Discriminant is a simple linear projection where 
the projection matrix is chosen according to a given criterion such as the Fisher 
criterion [8]. The Fisher criterion aims at maximizing the ratio of between-class 
scatter to witlrin-class scatter. Finally, the Fisher Linear Discriminant subspace 
holds more discriminant features for classification [9] than the PCA subspace. 
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In [5], the projection of a face image into the system of Fisher-faces yields 
a representation which will emphasize the discriminatory content of the image. 
The main decision tool is Support Vector Machines (SVMs). 

The above approach involves thousands of parameters that are difficult to 
store on a smart-card for instance. New approaches should be investigate to build 
classifiers using only hundreds of parameters. Recently, boosting algorithms has 
emerged to boost the performance of simple (weak) classifiers by combining them 
iteratively. The famous AdaBoost algorithm have been proposed for object 
detection [10] and applied successfully to face detection [11]. AdaBoost have 
been applied also to face authentication [12] to boost classifiers based on Haar- 
like features (Fig. 1) as described in [11]. Unfortunately, this boosting approach 
has obtained results far from the state-of-the-art. 



DHSgcEl 



Fig. 1. Five types of Haar-like features. 



3 The Proposed Approach 

In face authentication, we are interested in particular objects, namely faces. The 
representation used to code input images in most state-of-the-art methods are 
often based on gray-scale face image. Thus, we propose to use AdaBoost to boost 
weak classifiers based simply on pixel values. 

3.1 Feature Extraction 

In a real application, the face bounding box will be provided by an accurate 
face detector [13, 14] but here the bounding box is computed using manually 
located eyes coordinates, assuming a perfect face detection. In this paper, the 
face bounding box is determined using face/head anthropometry measures [15] 
according to a face model (Fig. 2). 

The face bounding box w/h crops the physiognomical height of the face. 
The width w of the face is given by zy_zy/s where s = 2-pupil_se/x_ee and 
x_ee is the distance between eyes in pixels. In this model, the ratio w/h is 
equal to the ratio 15/20. Thus, the height h of the face is given by w-20/15 and 
y_upper = h-(tr_gn - en_gn) / tr_gn. The constants pupil_se (pupil-facial 
middle distance), en_gn (lower half of the craniofacial height), tr_gn (height of 
the face), and zy_zy (width of the face) can be found in [15]. 

The extracted face is downsized to a 15x20 image. Then, we perform his- 
togram normalization to modify the contrast of the image in order to enhance 
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Fig. 2. Face modeling and pre-processing. On the left: the face modeling using 
eyes center coordinates and facial anthropometry measures. On top-right: the 
original face image. On the bottom-right: the pre-processed face image. 



important features. Finally, we smooth the enhanced image by convolving a 
3x3 Gaussian ( a = 0.25) in order to reduce the noise. After enhancement and 
smoothing (Fig. 2), the face image becomes a feature vector of dimension 300. 



3.2 Boosting Weak Classifiers 

Introduction A complete introduction to the theoretical basis of boosting and 
its applications can be found in [16] . The underlying idea of boosting is to linearly 
combine simple weak classifiers hi(x) to build a strong ensemble f(x): 

n 

/( x ) = YaMx) 

i=l 

Both coefficients a* and hypothesis hi(x) are learned by the boosting algo- 
rithm. Each classifier hi(x) aims to minimize the training error on a particular 
distribution of the training examples. 

At each iteration (i.e. for each weak classifier), the boosting procedure mod- 
ifies the weight of each pattern in such a way that the misclassified samples get 
more weight in the next iteration. Boosting hence focuses on the examples that 
are hard to classify. 

AdaBoost [17] is the most well known boosting procedure. It has been used 
in numerous empirical studies and have received considerable attention from 
the machine learning community in the last years. Freund et al. [17] showed 
two interesting properties of AdaBoost. First, the training error exponentially 
goes down to zero as the number of classifiers grows. Second, AdaBoost still 
learns after the training error reaches zero. Regarding the last point, Sclrapire 
et al. [18] shown that AdaBoost not only classifies samples correctly, but also 
compute hypothesis with large margins The margin of an example is defined as 
its signed distance to the lryperplane times its label. A positive margin means 
that the example is well classified. This observation has motivated searching for 
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boosting procedures which maximize the margin [19, 20]. It has been shown that 
maximizing the margin minimizes the generalization error [18]. 



Boosting Pixel-Based Weak Classifiers We choose to boost weak classifiers 
based simply on pixel values, as described in [10] for face detection. The weak 
classifier hi to boost is given by: 



hi(x) 



1 : x fi < 9i 

0 : Xf t > 0i 



where x is the given input image, f t is the index of the pixel to test in the image 
x and 0i is a threshold. AdaBoost estimates iteratively the best feature {fi,9i} 
for 1 < i < 300. 



4 The XM2VTS Database and Protocol 

The XM2VTS database contains synchronized image and speech data recorded 
on 295 subjects during four sessions taken at one month intervals. The 295 
subjects were divided, according to the Lausanne Protocol [21], into a set of 200 
clients, 25 evaluation impostors, and 70 test impostors. Two different evaluation 
configurations were defined. They differ in the distribution of client training and 
client evaluation data. Both the training client and evaluation client data were 
drawn from the same recording sessions for Configuration I (LP1) which might 
lead to biased estimation on the evaluation set and hence poor performance on 
the test set. For Configuration II (LP2) on the other hand, the evaluation client 
and test client sets are drawn from different recording sessions which might lead 
to more realistic results. This led to the following statistics: 

— Training client accesses: 3 for LP1 and 4 for LP2 

— Evaluation client accesses: 600 for LP1 and 400 for LP2 

— Evaluation impostor accesses: 40, 000 (25 * 8 * 200) 

— Test client accesses: 400 (200 * 2) 

— Test impostor accesses: 112, 000 (70 * 8 * 200) 

Thus, the system may make two types of errors: false acceptances (FA), when 
the system accepts an impostor, and false rejections (FR), when the system 
rejects a client. In order to be independent on the specific dataset distribution, 
the performance of the system is often measured in terms of these two different 
errors, as follows: 



FAR = 



number of FAs 
number of impostor accesses 



FRR = 



number of FRs 



(1) 



number of client accesses 



(2) 
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A unique measure often used combines these two ratios into the so-called 
Half Total Error Rate (HTER) as follows: 



HTER = 



FAR + FRR 
2 



(3) 



Most authentication systems output a score for each access. Selecting a 
threshold over which scores are considered genuine clients instead of impostors 
can greatly modify the relative performance of FAR and FRR. A typical thresh- 
old chosen is the one that reaches the Equal Error Rate (EER) where FAR=FRR 
on a separate validation set. 



5 Experimental Results 

In this section, we provide experimental 1 results obtained by our approach, pixel- 
based boosted weak classifiers, on the configuration I of the Lausanne Protocol. 
We compare the results obtained to the state-of-the-art and to similar work using 
AdaBoost. 




Face feature vector 
of dimension 300 (15x20) 



Fig. 3. Pixel-based boosted classifier for face authentication. 



For each client, three shots are available. Each shot was slightly shifted, scaled 
and mirrored to obtain 220 examples. 2x220 patterns were used for training the 
client model and 1x220 patterns were used as a validation set to evaluate a 
threshold decision. The negative samples (pseudo-impostors) were generated by 
taking the three original shots of all other clients ((200-1) clients x 3 shots = 
1194 patterns). A model has been trained for each client. 

In table 1, we provide the results obtained by our boosting approach ( AdaPix ) 
using different number of classifiers (50, 100, 150, 200). We provide also re- 
sults obtained by a state-of-the-art approach, namely Normalized Correlation 
(JVC) [6], and results obtained using boosted classifiers based on seven Haar-like 
features [12] ( AdaHaarT) . 



1 The machine learning library used for all experiments is Torch http://www.torch.ch. 
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Table 1 . Comparative results in terms of FAR/FRR and HTER for LP1 



Model FAR FRR HTER 



NC [6] 

AdaHaar7 200 [12] 
AdaPix 50 
AdaPix 100 
AdaPix 150 
AdaPix 200 
AdaHaarS 100 



3.46 


2.75 


3.1 


6.9 


00 

oo 


7.85 


3.34 


4.0 


3.67 


3.16 


3.5 


3.33 


3.11 


3.5 


3.30 


2.75 


3.0 


2.87 


2.29 


5.0 


3.64 



From these results, it can be shown that the performance of AdaPix increase 
when increasing the number of classifiers. It can be shown also that they can 
be compared to the state-of-the-art (NC). AdaPix outperforms AdaHaar 7 with 
less classifiers. Furthermore, AdaHaarl obtained results far from the state-of- 
the-art. As a fair comparison, we used our AdaBoost algorithm to boost weak 
classifiers for the three first types (Fig. 1) of Haar-like features ( AdaHaarS ), and 
we obtained an HTER two times smaller than AdaHaarl with two times less 
classifiers. 

6 Conclusion 

In this paper, we proposed the use of AdaBoost for face authentication to boost 
weak classifiers based simply on pixel values. The proposed approach was tested 
on a benchmark database, namely XM2VTS, using its associate protocol. Results 
have shown that boosting only hundreds of classifiers achieved near state-of-the- 
art results. Furthermore, the proposed approach outperforms similar work on 
face authentication using boosting algorithms on the same database. 

Boosting algorithms will certainly be used more and more often in face au- 
thentication. A new direction will be probably, to combine the efficiency of boost- 
ing algorithms with discriminant features such as LDA. 
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Abstract. The null space of the within-class scatter matrix is found to express 
most discriminative information for the small sample size problem (SSSP). The 
null space-based LDA takes full advantage of the null space while the other 
methods remove the null space. It proves to be optimal in perfomiance. From 
the theoretical analysis, we present the NLDA algorithm and the most suitable 
situation for NLDA. Our method is simpler than all other null space approaches, 
it saves the computational cost and maintains the performance simultaneously. 
Furthermore, kernel technique is incorporated into discriminant analysis in the 
null space. Firstly, all samples are mapped to the kernel space through a better 
kernel function, called Cosine kernel, which is proposed to increase the dis- 
criminating capability of the original polynomial kernel function. Secondly, a 
truncated NLDA is employed. The novel approach only requires one eigen- 
value analysis and is also applicable to the large sample size problem. Experi- 
ments are carried out on different face data sets to demonstrate the effective- 
ness of the proposed methods. 



1 Introduction 

Linear Discriminant Analysis (LDA) has been successfully applied to face recogni- 
tion. The objective of LDA is to seek a linear projection from the image space onto a 
low dimensional space by maximizing the between-class scatter and minimizing the 
within-class scatter simultaneously. Belhumeur [1] compared Fisherface with Eigen- 
face on the HARVARD and YALE face databases, and showed that LDA was better 
than PCA, especially under illumination variation. LDA was also evaluated favorably 
under the FERET testing framework [2], [7]. 

In many practical face recognition tasks, there are not enough samples to make the 
within-class scatter matrix S w nonsingular, this is called a small sample size problem. 
Different solutions have been proposed to deal with it in using LDA for face recogni- 
tion [l]-[6]. 

The most widely used methods (Fisherface) [1, 2, 3] applies PCA firstly to reduce 
the dimension of the samples to an intermediate dimension, which must be guaran- 
teed not more than the rank of S w so as to obtain a full-rank within-class scatter ma- 
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trix. Then standard LDA is used to extract and represent facial features. All these 
methods above do not consider the importance of null space of the within-class scat- 
ter matrix, and remove the null space to make the resulting within-class scatter full- 
rank. 

Yang et al. [4] proposed a new algorithm which incorporates the concept of null 
space. It first removes the null space of the between-class scatter matrix S* and seeks a 
projection to minimize the within-class scatter (called Direct LDA / DLDA). Because 
the rank of S h is smaller than that of S w , removing the null space of .S'/, may lose part 
of or the entire null space of S w , which is very likely to be full-rank after the remov- 
ing operation. 

Chen et al. [5] proposed a more straightforward method that makes use of the null 
space of5 w . The basic idea is to project all the samples onto the null space of S w , 
where the resulting within-class scatter is zero, and then maximize the between-class 
scatter. This method involves computing eigenvalue in a very large dimension since 
S H , is an n x n matrix. To avoid the great computational cost, pixel grouping method is 
used in advance to artificially extract features and to reduce the dimension of the 
original samples. 

Huang et al. [6] introduced a more efficient null space approach. The basic notion 
behind the algorithm is that the null space of S w is particularly useful in discriminat- 
ing ability, whereas, that of S is useless. They proved that the null space of the total 
scatter matrix S, is the common null space of both S w and .S’/,. Hence the algorithm 
firstly removes the null space of S, and projects the samples onto the null space of S w . 
Then it removes the nidi space of the between-class scatter in the subspace to get the 
optimal discriminant vectors. 

Although null space-based LDA seems to be more efficient than other linear sub- 
space analysis methods for face recognition, it is still a linear technique in nature. 
Hence it is inadequate to describe the complexity of real face images because of illu- 
mination, facial expression and pose variations. The kernel technique has been exten- 
sively demonstrated to be capable of efficiently representing complex nonlinear rela- 
tions of the input data. Kernel Fisher Discriminant Analysis [8, 9, 10] (KFDA) is an 
efficient nonlinear subspace analysis method, which combines the kernel technique 
with LDA. After the input data are mapped into an implicit feature space, LDA is 
performed to yield nonlinear discriminating features of the input data. 

In this paper, some elements of state-of-the-art null space techniques will be 
looked at in more depth and our null space approach is proposed to save the computa- 
tional cost and maintain the performance simultaneously. Furthermore, we concen- 
trate on the advantages of both the null space approach and the kernel technique. A 
kernel mapping based on an efficient kernel function, called Cosine kernel, is per- 
formed on all the samples firstly. In kernel space, we can find that the total scatter 
matrix is full-rank, so the procedure of the null space approach is greatly simplified 
and more stable in numerical computation. 

The paper is laid out as follows. In Section 2, the related work on LDA-based al- 
gorithms will be reviewed. Next, our null space method (NLDA) will be presented. In 
Section 4 null space-based KFDA (NKFDA) will be proposed and some experiments 
will be reported in Section 5. Finally, Section 6 ends with some conclusions. 
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2 Previous Work 

Some assumptions and definitions in mathematics are provided at first. Let n denote 



the dimension of the original sample space, and c is the number of classes. The be- 
tween-class scatter matrix S b and the within-class scatter ,S'„. arc defined as below: 


C 

S „=Yj N l (m : ~ W )H ~ m ) T - ’ 

1=1 


(1) 


^ = (*t - n h)(x k - m t ) T = ® , ®" , 

i = 1 k<=Q 


(2) 


where Nj is the number of samples in class C,- (7=1, 2,..., c), N is the number of all 
samples, ny is the mean of the samples in the class C,-, and m is the overall mean of 
all samples. The total scatter matrix i.e. the covariance matrix of all the samples is 
defined as: 


S,=S b + s w = ^ (x - m)(x i - m) T = ®,® ' . 

i - 1 


(3) 


LDA tries to find an optimal projection: W = \w l ,w 2 ,w 3 ,..., w c 


j ] , which satisfies 


\w T s b w\ 

J(W) = arg max , 

W \w T s w 


(4) 


that is just Fisher criterion function. 





2.1 Standard LDA and Direct LDA 

As well known. W can be constructed by the eigenvectors of S w 'S b . But this method 
is numerically unstable because it involves the direct inversion of a likely high- 
dimensional matrix. The most frequently used LDA algorithm in practice is based on 
simultaneous diagonalization. The basic idea of the algorithm is to find a matrix W 
that can simultaneously diagonalize both S w and S„, i.e., 

W T SW = / , W T S b W = A . (5) 

Most algorithms require that S w be non-singular, because the algorithms diagonal- 
ize S w first. The above procedure will break down when S w becomes singular, it 
surely happens when the number of training samples is smaller than the dimension of 
the sample vector, i.e. the small sample size problem (SSSP). The singularity exists 
for most face recognition tasks. 

An available solution to this problem is to perform PCA to project the n- 
dimensional image space onto a lower dimensional subspace. The PCA step essen- 
tially removes null space from both S w and .S'/,. Therefore, this step potentially loses 
useful information. 

In fact, the null space of S w contains the most discriminative information especially 
when the projection of S b is not zero in that direction. The Direct LDA (DLDA) algo- 
rithm [4] is presented to keep the null space of S w . 
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DLDA removes the null space of St firstly by performing eigen-analysis on St , 
then a simultaneous procedure is used to seek the optimal discriminant vectors in the 
subspace of St, i.e. 

W T S h W = I,W T S W = D . (6) 

b 7 w w v ' 

Because the rank of St is smaller than that of S w in majority, removing the null 
space of St may lose part of or the entire null space of S w , which is very likely to be 
full-rank after the removing operation. So, DLDA does not make full use of the null 
space. 



2.2 Null Space-Based LDA 



From Fisher’s criterion that is objective function (4), we can find that: In standard 
LDA, W is seeked such that (5), so the form of the optimal solution provided by stan- 
dard LDA is 



optimum - max 



W T SJv\lW T S w W\ = \K\ = optmaail\ . 



(7) 



In DLDA, W is seeked such that (6), so the form of the optimal solution provided by 
DLDA is 



optimum - max \W T S b W / \W T S n W = 1/|Z)J = Hoptmm . (8) 

DLDA W 

Compared with above LDA approaches, a more reasonable method (Chen [5]), we 
called Null Space-based LDA, has been presented. In Chen’s theory, null space-based 
LDA should reach below: 



optimum - max \W T S b W\ / \W T S n W\ = opt max/ 0 . (9) 

Null w 

That means the optimal projection W should satisfy 

W T SW = 0, W T S h W = A , (10) 

i.e. the optimal discriminant vectors must exist in the null space of S w . 

In a performance benchmark, we can conclude that null space-based LDA gener- 
ally outperforms LDA (Fisherface) or DLDA since 



optimum = oo > optimum > optimum . (11) 

Null DLDA LDA 

Because the computational complexity of extracting the nidi space of S w is very 
high because of the high dimension of S w . So in [5] a pixel grouping operation is 
used in advance to extract geometric features and to reduce the dimension of the 
samples. However, the pixel grouping preprocess is irresponsible and may arouse a 
loss of useful facial features. 



3 Our Null Space Method (NLDA) 

In this section, the essence of null space-based LDA in the SSSP is revealed by theo- 
retical justification, and the most suitable situation of null space methods is discov- 
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ered. Next, we propose the NLDA algorithm, which is conceptually simple yet pow- 
erful in performance. 



3.1 Most Suitable Situation 

For the small sample size problem (SSSP) in which n>N, the dimension of null space 
of S w is very large, and not all nidi space contributes to the discriminative power. 
Since both S b and S w are symmetric and semi-positive, we can prove, as mentioned in 
[6], that 

N(S,) = N(S b )riN(S w ). ( 12 ) 

From the statistical perspective, the null space of S b is of no use in its contribution 
to discriminative ability. Therefore, the useful subspace of null space of S w is 

N(S w ) = N(S w )-N(S,) = N(S w )r[N^) . (13) 

The sufficient and necessary condition so that null space methods work is 

N(S w ) * ® => N(S w ) N(S t ) =^> dim N(S w ) > dim N(S t ) => 

rank (S t ) > rank (S w ) . (14) 

In many cases, 

rank(S t ) = mm{n,N -\},rank(S w ) = mm{n,N -c} , (15) 

the dimension of discriminative null space of S w can be evaluated from (12): 

dim N(SJ = rank(S t )~ rank(S w ) . (16) 

If n < N - c , due to rank(S t ) = n < rank(S ) = N - c , the necessary condition 

(14) is not satisfied so that we can not extract any null space. That means any null 

space -based method does not work in the large sample size case. 

IfIV-c<«<IV-l, due to rank(S t ) = n > rank(S ) = N -c , the dimension of 
effective null space can be evaluated from (16): dim N(S ) = n - N + c < c - 1 . 
Hence, the number of discriminant vectors would be less than c-1, and some dis- 
criminatory information maybe lost. 

Only when n> N - 1 (SSSP), for rank(S t ) = N - 1 > rank(S w ) = N - c , we derive 
dim N(S ) = c - 1 .The dimension of extracted null space is just c-1, which coincides 
with the number of ideal features for classification. Therefore, we can conclude that 
nidi space methods are always applicable to any small sample size problem. 

Especially when n is equal to IV- 1, S, is full-rank and M(Sj is null. By (13) we 
have N(S n ) = N(S w ), it follows all null space of S w contributes to the discriminative 
power. Hence, we conclude the most suitable situation for null space-based methods: 

n = N - 1 . (17) 



3.2 NLDA 



Combining (12)-(16), we develop our null space method. 
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algorithm I: 

1 . Remove the null space of S t . 

Perform PCA to project the //-dimensional image space onto a low dimensional 
subspace, i.e. perform eigen-analysis on S,., the dimension of the extracted sub- 
space is usually N- 1 . The projection P, whose columns are all the eigenvectors of .S', 
corresponding to the nonzero eigenvalues, are calculated firstly, and then the 
within-class scatter and between-class scatter in the resulting subspace are ob- 
tained. 

P T S P - D , P T S P = S' , P T S h P = S' . 

t f 5 w w 5 b b 

2. Extract the null space of S w . 

Diagonalize S w , we have 

V t S'V = D , 

w w 7 

where V 7 V — /, D is diagonal matrix sorted in increasing order. Discard those 
with eigenvalues sufficiently far from 0, keep c - 1 eigenvectors of .S’,, in most cases. 
Let The the first c-1 columns of V, which is the null space of S w , we have 

Y t S ’ Y - 0 , Y t S'Y = S", . 

w 5 b b 

3. Diagonalize S/, (usually a (c-l)x(c-l) matrix) which is full-rank. 

Perform eigen-analysis: 

U T S’ b U = A , 

where U T U = I, A is diagonal matrix sorted in decreasing order. 

The final projection matrix is: 

W = PYU , 

W is usually an n x (c- l ) matrix, which diagonalizes both the numerator and the de- 
nominator of Fisher’s criterion to (c-l)x(c-l) matrices as (10) , especially leads to a 
denominator of 0 matrix. 

It is notable that the third step of Huang [6]’ algorithm is used to remove the null 
space of St . In fact, we are able to prove that it is full-rank once through the previous 
two steps. 

Lemmas .S’/, is full-rank, .S), is defined in step2 of algorithm I. 

Proof: 

From stepl and 2, we derive that S b = Y T S b Y = Y T S b Y + Y T SY = T r (5 fc + S JY = 
Y T P t (S b + S w )PY = Y T P t S t PY = Y T D t Y , for any vector a whose dimension is 
equal to that of Sb , a’ S b a = a Y 1 D t Ya = (D l t ,2 Ya) T \ d] 1 Y a) > 0 , so Sb is semi- 
positive. Suppose there exists a such that a S b a = 0 , then D t ~Ya = 0 . By stepl, 
we know D, is full-rank, thus Y a = 0 . And by step2, we derive that Y is full-rank in 
columns since it is the extracted null space. Hence a = 0 , iff. a S h a = 0 . Therefore 
Sb is a positive matrix which is of course fill l-rank. □ 

The third step is optional. Although it maximizes the between-class scatter in the 
null subspace, which appears to achieve best discriminative ability, it may incur over- 
fitting. Because projecting all samples onto the null space of .S’„. is powerful enough in 
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its clustering ability to achieve good generalization performance, step3 of algorithm / 
should be eliminated in order to avoid possible overfitting. 

NLDA algorithm: 

1 . Remove the null space of S h i.e. 

P T S,P = D t ,P T S i P = Sl, 

P is usually /?x(A-l). 

2. Extract the null space of S w , i.e. 

y r s'y = o , 

w 5 

7is the null space, and is usually (A-l)x(c-l). 

The final NLDA projection matrix is: 

W — PY , 

PY is the discriminative subspace of the whole null space of S w and is really useful for 
discrimination. The number of the optimal discriminant vectors is usually c- 1, which 
just coincides with the number of ideal discriminant vectors [1]. Therefore, removing 
step3 is a feasible strategy against overfitting. 

Under situation (17), .S', is full-rank and stepl of the NLDA algorithm is skipped. 
The NLDA projection can be extracted by performing eigen-analysis on S w directly. 
The procedure of NLDA under this situation is most straightforward and only re- 
quires one eigen-analysis. We can discover that NLDA will save much computational 
cost under the most suitable situation it is applicable to. 



4 Null Space-Based Kernel Fisher Discriminant Analysis 

The key idea of Kernel Fisher Discriminant Analysis (KFDA) [8, 9, 10] is to solve 
the problem of LDA in an implicit feature space F, which is constructed by the ker- 
nel trick: 

</>:x<eR" — >$x)eF . (18) 

The important feature of kernel techniques is that the implicit feature vector (j> 
needn’t be computed explicitly, while the inner product of any two vectors in F need 
to be computed based a kernel function. 

In this section, we will present a novel method (NKFDA) in which kernel tech- 
nique is incorporated into discriminant analysis in the null space. 



4.1 Kernel Fisher Discriminant Analysis (KFDA) 

The between-class scatter S/, and the within-class scatter S w in F are computed as (1) 
and (2). But at this time, we replace x .by^(x .) as samples in F. Consider perform- 
ing LDA in the implicit feature space F. It caters for maximizing the Fisher criterion 
function (4). 
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Because any solution w e F must lie in the span of all the samples in F, there ex- 
ist coefficients a , f = 1,2.. .N, such that 

N 

a A ■ ( l9 ) 

i = 1 

Substitute w in (4), the solution of (4) can be obtained by solve a new Fisher 
problem: 

\a T K h a\ 

J(a) = arg max j [ > (20) 

where K b and K w (Liu [8]) are based on new samples: 

£ = (k(x l ,x i ),k(x 2 ,x i ),...,k{x N ,x i )) T , 1 < i< N . (21) 

As for the kernel function, Liu [13] proposed a novel kernel function called Cosine 
kernel, which is based on the original polynomial kernel, has been demonstrated to 
improve the performance of KFDA. It is defined as below: 

k(x, y) = (<p(x) ■ <p{y)) = (a(x-y) + b) d , (22) 



k(x,y) 



k(x, y) 

yjk(x,x)k(y,y) 



(23) 



In our experiments, Cosine kernel (a= \ 0 3 fsizeof (image), b= 0, d= 2) is adopted and 
shows good performance in face recognition. Cosine measurement should be more 
reliable than inner production measurement due to a better similarity representation in 
the implicit feature space. 



4.2 NKFDA 

Here we define a kernel sample set (corresponding to the kernel space in N dimen- 
sions) {Ci} t r V -The optimal solution of (4) is equivalent to that of (20), so the 
original problem can be entirely converted to the problem of LDA on the kernel sam- 
ple set. 

In section 3, we know that NLDA will save much computational cost under the 
most suitable situation. The null space projection can be extracted from the within- 
class scatter directly. Our objective is just to transform the dimension of all the sam- 
ples from n to N - 1 through the kernel mapping, so that NLDA can work under the 
most suitable situation. Any method that can transform raw samples to (/V- 1 )- 
dimensional data without adding or losing main information, can exploit the merit of 
NLDA. 



In (19), all the training samples in F, {<!>,} ]<i<N , are used to represent w. 


Define the 


kernel matrix M, 


M = {k{x nXj )\^ N = (k hj )^ JiN , 


(24) 


assume O = 


, then M = cp 7 ®. In mathematics, 






rank( ®) = rank(M ) . 


(25) 
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Because rank(M) < N holds, especially when the training data set is very large, it 
follows thatt'ank(M) «IV[11][12], we conclude that 

rank( O) « N . (26) 

Due to (26), we may assume </> N is not a basis vector of {^}|<;<,v without loss of 
generality, and consequently we can rewrite (19) as follows: 

w = , (27) 

i = 1 

subsequently, K b and K w are recomputed, we derive : 

£ i =(k(x l ,x i ),k(x 2 ,x i ),...,k(x N _ 1 ,x i )) T . (28) 

Now the dimension of our defined kernel space is N- 1. My objective is just to enable 
NLDA work on the (TV- 1 ) -dimensional kernel sample set. 



Input: 1) training samples {x. } 1<j<JV and label set {C . } 1<;<c 

2) the kernel function and its parameters: k(x, y) 

Algorithm: 

1. For i = 1,2,. .,2V 

do kernel mapping on each training sample: 

K(x.) = (k(x l ,x i ),k(x 2 ,x i ),...,k(x N _ l ,x i )) r . 

For a new sample x, whose corresponding point in the kernel space is 
K(x) = (k(x l ,x),k(x 2 ,x),...,k(x N _ l ,x))' . 

2. Calculate class mean and within-class scatter: 

m j = X K(x i) / N j . 

leC j 

C 

K „ ) - m j )( K ( X , ) - m j ) r ■ 

j = 1 isCj 

3. Extract the null space Y of K„, (N-lxN-l), such that 
Y T KJY = 0 , Y is usually in (IV-l)x(c-l). 

Output: The resulting mapping on the raw sample set: 
x ¥(x) = (Y T K)-{x) = Y T ■ K{x). 



Fig. 1 . NKFDA algorithm 



As shown in Fig. 1, NKFDA algorithm outputs the mapping V F which is a nonlin- 
ear dimensionality reduction mapping ( n dimensions reduce to c-1). For any sample 
(whether it is a prototype or a query), V F provides a universal mapping to transform 
the raw sample point into a lower dimensional space. Such a technique can be applied 
with a reasonable implementation of generalization. 

It’s noticeable that our method NKFDA also cannot deal with the case that only 
one sample per person is available for training since KFDA can not achieve that. 
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For the large sample size problem (n«N), S w is full-rank so that we can not extract 
any null space. That means any null space-based method does not work in the large 
sample size case. However, after the kernel mapping, NLDA can work on the kernel 
sample set. Hence the kernel mapping extends the ability of null space approaches to 
the large sample size problem. 



5 Experiments 

To demonstrate the efficiency of our method, extensive experiments are done on the 
ORL face database, the FERET database and the mixture database. All LDA methods 
were compared on the same training sets and testing sets, including Fisherface pro- 
posed in [1, 2, 3], Direct LDA proposed in [4], and our methods: NLDA and NKFDA. 



5.1 ORL Database 

There are 10 different images for each subject in the ORL face database composed of 
40 distinct subjects. All the subjects are in up-right, frontal position. The size of each 
face image is 92x112. The first line of Fig. 2 shows 6 images of the same subject. 

We listed the recognition rates with different number of training samples. The 
number of training samples per subject, k, increases from 2 to 9. In each round, k 
images are randomly selected from the database for training and the remaining im- 
ages of the same subject for testing. For each k, 20 tests were performed and these 
results were averaged. Table 1 shows the average recognition rates (%). Without any 
pre-processing, we choose 39 (i.e. c-1) as the final dimensions. Our methods NLDA, 
NKFDA show an encouraging performance. 



Table 1 . Recognition rates on the ORL database 



k 


LDA 


DLDA 


NLDA 


NKFD 

A 


2 


76.65 


80.10 


85.47 


82.89 


3 


87.09 


87.54 


90.91 


89.13 


4 


91.68 


91.50 


93.86 


93.15 


5 


93.17 


94.65 


95.45 


95.13 


6 


95.79 


96.50 


97.13 


96.72 


7 


96.85 


97.12 


97.54 


97.21 


8 


98.25 


99.15 


98.95 


98.95 


9 


99.00 


99.95 


99.15 


99.38 



5.2 FERET Database 

We have to test our method on more complex and challenging datasets such as the 
FERET database. We selected 70 subjects from the FERET database [7] with 6 up- 
right, frontal-view images of each subject. The face images involve much more varia- 
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tions in lighting, facial expressions and facial details. The second line of Fig. 2 shows 
one subject from the selected data set. 

The eye locations are fixed by geometric normalization. The size of face images is 
normalized to 92x112, and 69 (i.e. c-1) features are extracted. Training and test proc- 
ess are similar to those on the ORL database. Similar comparisons between those 
methods are performed. This time k changes between 2 to 5, and the corresponding 
averaging recognition rates (%) are shown in table 2. 



Table 2. Recognition rates on the FERET database 



k 


LDA 


DLDA 


NLDA 


NKFD 

A 


2 


56.04 


63.25 


75.20 


72.21 


3 


76.95 


76.71 


85.64 


83.60 


4 


87.23 


88.30 


92.79 


93.85 


5 


94.80 


94.71 


97.34 


98.29 



5.3 Mixture Database 

To test NLDA and NKFDA on large datasets, we construct a mixture database of 125 
persons and 985 images, which is a collection of three databases: (a). The ORL data- 
base (10x40). (b). The YALE database (11x15, the third line of Fig. 2 shows one 
subject), (c). The FERET database (6x70). All the images are resized to 92x112. 
There are facial expression, illumination and pose variations. 




Fig. 2. Samples from the mixture database 

The mixture database is divided into two non-overlapping set for training and test- 
ing. The training dataset consists of 500 images: 5 images, 6 images and 3 images per 
person are randomly selected from the ORL. the YALE database and the FERET sub- 
set respectively. The remaining 485 images are used for testing. In order to reduce the 
influence of some extreme illumination, histogram equalization is applied to the im- 
ages as pre-processing. We compare the proposed method with Fisherface and 
DLDA, and the experimental results are shown in Fig. 3. It can be seen that NKFDA 
largely outperforms the other three when over 100 features are used, and a recogni- 
tion rate of 91.65% can be achieved at the feature dimension of 124 (i.e. c-1). 
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5.4 Discussion 

From the above three experiments, we can find that NKFDA is better than NLDA for 
large number of training samples (such as larger than 300), while worse than NLDA 
in the case of small training sample size (such as smaller than 200), and superior to 
DLDA in most situations. Consequently, NKFDA is more efficient in larger sample 
size, for the greater the sample size, the more accurately kernels can describe the 
nonlinear relations of samples. 

As to computational cost, the most time-consuming procedure, eigen-analysis, is 
performed on three matrices (one of N X N, and two of (N-c) x 0V-c)) in Fisherface 
method, on two matrices (c x c and (c- 1 ) x(c-l )) in DLDA, on two matrices (N X N, (N- 
I ) X (/V- 1 ) ) in NLDA, and on one matrice ((IV- 1 ) X (/V- 1 )) in NKFDA. Our method 
NKFDA only performs one eigen-analysis to achieve efficiency and good perform- 
ance. 




Fig. 3. Comparison of four methods 



6 Conclusion 

In this paper, we present two new subspace methods (NLDA, NKFDA) based on the 
null space approach and the kernel technique. Both of them effectively solve the 
small sample size problem and eliminate the possibility of losing discriminative in- 
formation. 

The main contributions of this paper are summarized as follows: (a) The essence of 
null space -based LDA in the SSSP is revealed by theoretical justification, and the 
most suitable situation of null space method is discovered, (b) Propose the NLDA 
algorithm, which is simpler than all other null space methods and saves the computa- 
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tional cost and maintains the performance simultaneously, (c) A more efficient Co- 
sine kernel function is adopted to enhance the capability of the original polynomial 
kernel, (d) Present the NKFDA algorithm, which performs only one eigen-analysis 
and is more stable in numerical computation, (e) NKFDA is also applicable to the 
large sample size problem, and is superior to NLDA when the sample size is very 
large. 
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Abstract. Alignment between the input and target objects has great 
impact on the performance of image analysis and recognition system, 
such as those for medical image and face recognition. Active Shape Mod- 
els (ASM)[1] and Active Appearance Models (AAM) [2, 3] provide an 
important framework for this task. However, an effective method for the 
evaluation of ASM/ AAM alignment results has been lacking. Without 
an alignment quality evaluation mechanism, a bad alignment cannot be 
identified and this can drop system performance. 

In this paper, we propose a statistical learning approach for constructing 
an evaluation function for face alignment. A nonlinear classification func- 
tion is learned from a set of positive (good alignment) and negative (bad 
alignment) training examples to effectively distinguish between qualified 
and un-qualified alignment results. The AdaBoost learning algorithm is 
used, where weak classifiers are constructed based on edge features and 
combined into a strong classifier. Several strong classifiers is learned in 
stages using bootstrap samples during the training, and are then used in 
cascade in the test. Experimental results demonstrate that the classifica- 
tion function learned using the proposed approach provides semantically 
more meaningful scoring than the reconstruction error used in AAM for 
classification between qualified and un-qualified face alignment. 



1 Introduction 

Many image analysis and recognition application require alignment between an 
object in the input image and a target object. Alignment can have a great impact 
on the system performance. For examples, in appearance based face recognition, 
the alignment provide a more sensible foundation for template matching based 
recognition; the use of bad alignment can drop system performance significantly. 

Active Shape Models (ASM) [1] and Active Appearance Models(AAM) [2, 3] 
have been used as alignment algorithms in medical image analysis and face 
recognition [4], However, an effective method for the evaluation of ASM/AAM 
alignment results has been lacking: There has been no convergence criterion for 
ASM. As such, the ASM search can give a bad result without giving the user a 
warning. In the AAM, the PCA (Principal Component Analysis) reconstruction 
error is used as a distance measure for the evaluation of alignment quality (and 
for guiding the search as well). However, the reconstruction error may not be 
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a good discriminant for the evaluation of alignment quality because a non-face 
can look like a face when projected onto the PCA face subspace. 

In this paper, we propose a statistical learning approach for constructing 
an effective evaluation function for face alignment. A nonlinear classification 
function is learned from a training set of positive and negative training examples 
to effectively distinguish between qualified and un-qualified alignment results. 
The positive subset consists of qualified face alignment examples and the negative 
subset consists of obviously un-qualified and near-but-not-qualified examples. 

We use AdaBoost algorithm [5, 6] for the learning. A set of candidate weak 
classifiers are created based on edge features extracted using Sobel-like opera- 
tors. We choose to use edge features because crucial cues for alignment quality 
are around edges. Experimentally, we also found that the Sobel features pro- 
duced significant better results than other features such as Haar wavelets. The 
AdaBoost learning selects or learns a sequence of best features and the corre- 
sponding weak classifiers and combines them into a strong classifier. 

In the training stage several strong classifiers is learned in stages using boot- 
strap training samples, and in the test they are cascaded to form a stronger 
classifier, following an idea in boosting based face detection [7]. Such a divide- 
conquer strategy makes the training easier and the good-bad classification more 
effective. The evaluation function thus learned gives a quantitative confidence 
and the good-bad classification is achieved by comparing the confidence with a 
learned optimal threshold. 

There are two important distinctions between an evaluation function thus 
learned and the linear evaluation function of reconstruction error used in AAM. 
First, the evaluation is learned in such a way to distinguish between good and bad 
alignment. Secondly, the scoring is nonlinear, which provides a semantically more 
meaningful classification between good and bad alignment. Experimental results 
demonstrate that the classification function learned using the proposed approach 
provides semantically meaningful scoring for classification between qualified and 
un-qualified face alignment. 

The rest of the paper is organized as follows: Section 2 briefly describes 
the ASM method and the problem of alignment quality evaluation. AdaBoost 
based learning is presented in Section 3. Section 4 provides the construction of 
candidate weak classifiers. Section 5 proposes the learning of weak classifiers. 
Section 6 provides experimental results. Section 7 draws a conclusion. 



2 ASM/ A AM and Solution Quality Evaluation 

Let us briefly describe the ASM and AAM methods before a discussion about 
the issue of alignment evaluation. The standard ASM consists of two statisti- 
cal models: (1) global shape model, which is derived from the landmarks in the 
object contour; (2) local appearance models, which is derived from the profiles 
perpendicular to the object contour around each landmark. ASM uses local mod- 
els to find the candidate shape and the global model to constrain the searched 
shape. 
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AAM makes use of the PCA techniques to model both shape variation and 
texture variation, and the correlations between the shape subspace and texture 
subspace to model the face. In searching for a solution, it assumes linear relation- 
ships between appearance variation and texture variation, and between texture 
variation and position variation; and learns the two linear regression models from 
training data. The minimizations in high dimensional space is reduced in two 
models facilitate. This strategy is also developed in the active blob model by 
Sclaroff and Isidoro [8] . 

While the training data for ASM consists of shape only, and that for AAM 
consists of both shape and texture. Denote a shape Sq = ((aq, y{), . . . , (xk, Uk)) £ 
R 2K by a sequence of K points in the 2D image plane, and a texture T 0 using 
the patch of pixel intensities enclosed by Sq ■ Let S be the mean shape of all the 
training shapes, as illustrated in Fig. 1(a). Fig. 1(b) and (c) show two examples 
of shapes overlayed on the faces. In AAM, all the shapes are aligned or warping 
to the tangent space of the mean shape S. After that, the texture To is warped 
correspondingly to T £ where L is the number of pixels in the mean shape S. 
The warping may be done by pixel value interpolation, e.g. using a triangulation 
or thin plate spline method. 




(a) (b) (c) 

Fig. 1. (a) The mesh of the mean shape, (b) & (c): Two face instances labelled 
with 83 landmarks. 



There has been no convergence criterion for ASM search nor quality evalua- 
tion. In ASM search, the mean shape is placed near the center of the detected 
image and a coarse to fine search performed. Large movements are made in the 
first few iterations, getting the position roughly. As the search progressing, more 
subtle adjustments are made. The result can gives a good match to the target 
image or it can fail (see Figure. 2). The failure can happen even if the starting 
position is near the target. When the variations of expression and illumination 
are large, ASM search can diverge in order to match the local image pattern. 

In AAM search, the PCA reconstruction error is used to guide the search 
and used as the convergence and evaluation criterion. Such an error function 
is defined as the distance between the image patch (aimed to contain the face 
region only) after warping to the mean shape and the projection of the patch 
onto the PCA subspace of face texture. However, the reconstruction error may 
not be a good measure for the evaluation of alignment quality because a non-face 
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Fig. 2. Four face instances of qualified (top) and un-qualified (bottom) examples 
with their warped images 



can look like a face when projected onto the PCA face subspace. Cootes pointed 
out that, of 2700 testing examples, 519 failed to converge to a satisfactory result 
(the mean point position error is greater than 7.5 pixels per point) [4]. 

In the following we present a learning based approach for learning evaluation 
function for ASM/ A AM based alignment. 

3 AdaBoost Based Learning 

Our objective is to learn an evaluation function from a training set of qualified 
and un-qualified alignment examples. From now on, we use the terms positive and 
negative examples for classes of data. These examples are the face image after 
warping to mean shape, as shown in Fig. 2. Face alignment quality evaluation 
can be posed as a two class classification problem: given an alignment result x 
(i.e. warped face), the evaluation function H{x) = +1 if x is positive example, 
or —1 otherwise, we want to learn such an H{x) that can provide a score in 
[— 1,+1] with a threshold around 0 for the binary classification. 

For two class problems, a set of N labelled training examples is given as 
(cci, 2 / 1 ), . . . , (xn, Un ), where yi £ {+1, — 1} is the class label associated with ex- 
ample Xi £ R™. A stronger classifier is a linear combination of M weak classifiers 

M 

H m {x) = ^2 h™( x ) ( 1 ) 

m = 1 

In the real version of AdaBoost [5, 6], the weak classifiers can take a real 
value, h m {x) £ R, and have absorbed the coefficients needed in the discrete 
version (h m ( x) £ — 1,+1 in the latter case). The class label for x is obtained 
as H(x) = sign [7 /m (a:)] while the magnitude \Hm{x)\ indicates the confidence. 
Every training example is associated with a weight. During the learning pro- 
cess, the weights are updated dynamically in such a way that more emphasis is 
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placed on hard examples which are erroneously classified previously. It is noted 
in recent studies [9, 10, 11] that the artificial operation of explicit re-weighting 
is unnecessary and can be incorporated into a functional optimization procedure 
of boosting. 



0. (Input) 

(1) Training examples {(xi, yi), (x N , Vn)}, 
where N = a + 6; of which a examples have yi = + 1 
and b examples have yi = —1; 

(2) The maximum number M max of weak classifiers to be combined; 

1. (Initialization) 

= -A for those examples with yi = + 1 or 
wf = for those examples with yi = —1. 

M = 0; 

2. (Forward Inclusion) 
while M < AL max 

(1) M <— M + 1; 

(2) Choose hu according to Eq.4; 

(3) Update w\ M ^ <— exp[— yiHM{xi)\, and normalize to Yhi w \ M ^ ~ 1; 

3. (Output) 

H(x) = signE^ =1 hm(x) \. 



Fig. 3. RealBoost Algorithm. 



An error occurs when H{x) ^ y , or yHM(x) < 0. The “margin” of an example 
(x,y) achieved by h(x) G K on the training set examples is defined as yh(x). 
This can be considered as a measure of the confidence of the h’s prediction. 
The upper bound on classification error achieved by Hm can be derived as the 
following exponential loss function [12] 

J(H m ) = J2e~ ViHM(xi) = ^2e- yi '£™=i hm( - x) (2) 

i i 



AdaBoost construct h m {x) by stagewise minimization of Eq.(2). Given the cur- 
rent Hm-i{x ) = hm(x), the best Hm^x) for the new strong classifier 

Hm{x ) = Hm~i{x) + Hm{x) is the one which leads to the minimum cost 

Hm = argmin J(H M -i(x) + h\x )) (3) 



The minimizer is [5, 6] 



h M {x) 



1 P(y = + l\x,w (M -^) 

2 8 P(y = — \\x, w( M ~ 1 '>) 



(4) 
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where w^ M l \x,y) = exp (— i/Fm-i{x)) is the weight for the labeled example 
(x, y) and 



E(w(x, y) | x) 



(5) 



where E(-) stands for the mathematical expectation and lr^] is one if C is true 
or zero otherwise. P(y = — l|x, is defined similarly. 

The AdaBoost algorithm based on the descriptions from [5, 6] is shown in 
Fig. 3. There, the re-weiglrt formula in step 2.(3) is equivalent to the multiplica- 
tive rule in the original form of AdaBoost [13, 5]. In Section 3.2, we will present 
a statistical model for stagewise approximation of P{y = +l|x, u;( M_1 )). 



4 Construction of Candidate Weak Classifiers 



The optimal weak classifier at stage M is derived as Eq.(4). Using P(y\x,w) = 
p(x\y,w)P{y), it can be expressed as 

h M (x) = L m (x) - T (6) 



where 



Lm(x) = 
T = 



\ 106 

) io g 



p(x\y = +1, w) 
P(x\y = ~l,w) 

P{y = +1) 

P(y = - 1) 



(7) 

(8) 



The log likelihood ratio (LLR) Lm{x) is learned from the training examples of 
the two classes. The threshold T is determined by the log ratio of prior prob- 
abilities. In practice, T can be adjusted to balance between the detection and 
false alarm rates (i.e. to choose a point on the ROC curve). 

Learning optimal weak classifiers requires modelling the LLR of Eq.(7). 
Estimating the likelihood for high dimensional data x is a non-trivial task. 
In this work, we make use of the stagewise characteristics of boosting, and 
derive the likelihood p(x\y, based on an over-complete scalar feature 

set Z = {z[, . . . , z' K j . More specifically, we approximate p(x\y, by 

p{z\, . . . , zm- i, z'\y, «/ M-1 )) where z m (m = 1, . . . , M — 1) are the features that 
have already been selected from Z by the previous stages, and z 1 is the feature 
to be selected. The following describes the candidate feature set Z, and presents 
a method for constructing weak classifiers based on these features. 

Because the shape is about boundaries between regions, it makes sense to 
use edge information (magnitude or orientation or both) extracted from a grey- 
scale image. In this work, we use the simple Sobel filter for extracting the edge 
information. Two filters are used: K w for horizontal edges and Kh for vertical 
edges, as follows: 

1 2 1 \ 

0 0 0 (9) 

-1 -2 - 1 ) 




and Kh(w , h) = 
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The convolution of the image with the two filter masks gives two edge 
strength values. 



G w (w, h) = I\ w * I(w, h) 



(10) 



G h (w, h) = K h * I(w, h) (11) 

The edge magnitude and direction are obtained as: 

S(w, h) = ^Gl(w,h) + Gl(w : h) (12) 

4>(w, h) = arctan (^4 ^t) (13) 

G w {w, h) 

The edge information based on Sobel operator is sensitive to noise. To solve this 
problem, we use sub-block of image to convolve with Sobel filter (see Fig. 4), 
which is similar to Haar-like feature calculation. 



<- 
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Fig. 4. The two types of simple Sobel-like filters defined on sub- windows. The 
rectangles are of size w x h and are at distances of ( dw,dh ) apart. Each fea- 
ture takes a value calculated by the weighted (±1, ±2) sum of the pixels in the 
rectangles. 



5 Statistical Learning of Weak Classifiers 

A scalar feature z' k : x —> R, is a transform from the n-dimensional (400-D 
if a face example x is of size 20x20) data space to the real line. These block 
differences are an extension to the Sobel filters. For each face example of size 
20x20, there are hundreds of thousands of different z' k for admissible w, h, dw , dh 
values, so Z is an over-complete feature set for the intrinsically low-dimensional 
face pattern x. In this work, an optimal weak classifier (6) is associated with a 
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single scalar feature; to find the best new weak classifier is to choose the best 
corresponding feature. 

We can define the following component LLR’s for the target Lm(x): 






1 j P{Zm\v = +1, W {m 1} ) 

2 ° S p(z m \y = -l,w( rn ~ 1 '>) 



for the selected features, z m ' s (m = 1, . . . , M — 1), and 



L 



(M) 

k 



{x) 



1 p(z' k {x)\y = +l,w (M 

2 ° g P{z' k {x)\y = - 1 ,w (M - 1) ) 



(14) 



(15) 



for features to be selected, z' k £ Z. Then, after some mathematical derivation, 
we can approximate the target LLR function as 



Lm{x ) 



1 p(x\y = +1, u>( M ^) 

2 p(x\y = — 1, u;( M-1 )) 



M - 1 

L m {x) + L[ M) (x) 

m— 1 



(16) 



Let 



M—l 

AL m {x) = L m (x) - ^2 L m {x) (17) 

m= 1 

The best feature is the one whose corresponding L^\x) best fits ALm(x). It 
can be found as the solution to the following minimization problem 



N 

k* = argmin^ [ AL M (xi ) - (3L[ M \x z ) 

fe,/3 i= l L 



This can be done in two steps as follows: First, find k* for which 



(l[ M \ Xi ),l[ M \x2), • • -,l[ M \x at)) 



is most parallel to 



(AL m (xi), AL m {x2), • • • , AL m (xn )) 



(18) 



(19) 



(20) 



This amounts to finding k for which l|, M) is most correlated with ALm over the 
data distribution, and set Zm = z k ». Then, we compute 



0 * 



Eti AL M {xi)L k « (xi) 



Eh [Lk-(xiW 



(21) 



After that, we obtain 



Lm{x ) = /3*L k ,(x) 



(22) 
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The strong classifier is then given as 

M M 

Hm{x) = Y, - t) = Y l m(x) - MT (23) 

m = 1 m— 1 

The evaluation function Hm{x ) thus learned gives a quantitative confidence 
and the good-bad classification is achieved by comparing the confidence with the 
threshold value of 0 (zero). 

There are two important distinctions between an evaluation functions thus 
learned and the linear evaluation function of reconstruction error used in AAM. 
First, the evaluation is learned in such a way to distinguish between good and 
bad alignment. Secondly, the scoring is nonlinear, which provides a semantically 
more meaningful classification between good and bad alignment. 



6 Experimental Results 

The positive and negative training examples are generated as follows: All the 
shapes are aligned or warping to the tangent space of the mean shape S. After 
that, the texture To is warped correspondingly to T £ R L , where L is the number 
of pixels in the mean shape S. 

In our work, 2536 positive examples and 3000 negative examples are used 
to train a strong classifier. The 2536 positive examples are derived from 1268 
original positive examples plus the mirror images. The negative examples are 
generated by random rotating, scaling, shifting positive examples’ shape points. 
A strong classifier is trained to reject 92% negative examples, while correctly 
accepting 100% of positive examples. 

A cascade of classifiers is trained to obtain a computational effective model, 
makes training easier with divide-conquer strategy. When training a new stage, 
negative examples are bootstrapped based on the classifier trained in the pre- 
vious stages. The details of training a cascade of 5 stages is summarized Table 
1. As the result of training, we achieved 100% correct acceptance and correct 
rejection rates on the training set. 



Table 1 . Training results (WC: weak classifier) 



stage 


number ot pos 


number ot neg 


number ot WC 


False Alarm 


i 


2536 


3000 


22 


0.076 1 


2 


2536 


3000 


237 


0.069 


3 


2536 


888 


293 


0i203 


4 


2536 


235 


263 


0309 


5 


2536 


96 


208 


0.0 



We compare the proposed Aclaboost learning based method with the PCA 
texture reconstruction error based evaluation method, using the same data sets 
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Fig. 5. Correct rate curve for the reconstruction error based alignment evalua- 
tion for the training set. 



(but PCA does not need negative examples in the training). The dimensionality 
of the PCA subspace is chosen to retain 99% of the total variance of the data. The 
best threshold of reconstruction error is selected to minimize the classification 
error. Fig. 5 shows the ROC curve for the reconstruction error based alignment 
evaluation method for the training set. Note that this method cannot achieve 
100% correct rates. 




0.431 0.662 0.871 -0.510 0.432 -0.430 

1.092 1.243 -1.225 4.472 -1.775 -1.628 




0.243 -1.738 -1.350 -3.190 2.935 0.568 

Fig. 6. Alignment quality evaluation results: qualified (top part) and un-qualified 
(bottom part) alignment examples 
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Fig. 7. Comparision between reconstruction error method and boost method 



During the test, a total of 1528 aligned examples (800 qualified images and 
728 un-qualified images), which are not seen during the training, are used. We 
evaluate each face images and give a score in terms of (a) the confidence value 
Hm (x) for the learning based method and (b) the confidence value threshold — 
distpcA for the PCA based method. The qualified and un-qualified alignment 
decision is judged by comparing the score with the normalized threshold of 0. 
Some examples of qualified (the top part) and un-qualified (the bottom part) 
face alignment results are shown Fig. 6, with the corresponding scores (the first 
line of the numbers is for the proposed method, and the second line for the 
PCA based method). This qualitatively demonstrates better sensibility of the 
proposed method for alignment evaluation. 

Fig. 7 quantitatively compares the two methods in terms of their ROC curves 
(first plot) and error curves (the second plot), where the axis label P(pos/neg) 
means the false positive rate and so on. From the error curves, we can see that 
the equal error rate of the proposed method is about 40%, while that of recon- 
struction error based method is 48%. The proposed approach provides a more 
effective method to distinguish between qualified and un-qualified face alignment 
than the reconstruction error used in A AM. 

Lastly, we would like to make a comment on the choice of image features 
for construction weak classifiers: Experimentally, we also found that the Sobel 
features produced significant better results than other features such as Haar 
wavelets. This is not elaborated here. 

7 Conclusion and Future Work 

In this paper, we proposed a statistical learning approach for constructing an 
effective evaluation function for face alignment. A set of candidate weak clas- 
sifiers are created based on edge features extracted using Sobel-like operators. 
Experimental results demonstrate that the classification function learned using 
the proposed approach provides semantically more meaningful scoring than the 
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reconstruction error used in AAM for classification between qualified and un- 
qualified face alignment. While the number of negative examples (un-qualified 
alignment) is huge, so far we used only about 40,000+, and 2536 positive exam- 
ples. This training set is still smaller; and so when we easily achieved 100% of 
training accuracy, the test performance is significantly lower. We expect a better 
trained nonlinear quality evaluation function when a larger training data which 
covers larger variation is used. 
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Abstract. In this work we present the preliminary results of a face detection 
system based on an hybrid approach: it combines typical feature-based 
techniques with image-based analysis, in order to better exploit the main 
characteristics available in the input image. Different modules contribute to the 
face detection task: 1) a template-based approach initially proposed in [12], 2) 
an edge-extraction technique well suited to deal with illumination-changes, 3) a 
multiple-classifier specifically designed to discard false positives and 4) a novel 
method based on a featureless representation of the eye-patterns that further 
improves the face/non-face discrimination. The experimental results show that 
the system can localize faces in images with complex background, even in 
presence of strong illumination changes. 



1. Introduction 

The problem of face detection can be defined as follows: given a still image or a 
video, detect and localize an unknown number of faces. The solution to the problem 
involves segmentation, extraction, and verification of faces and facial features from 
an uncontrolled background. 

Automatic face location is a very important task, which constitutes the first step of 
a large area of applications: face recognition, face retrieval by similarity, face 
tracking, surveillance, etc. [1], [18], [7]. In the opinion of many researchers, face 
location is the most critical step towards the development of practical face-based 
biometric systems, since its accuracy and efficiency have a direct impact on the 
system usability. Several factors contribute to make this task very complex, especially 
in the case of applications requiring to operate in real-time on gray-scale static 
images. The challenges associated with face detection can be attributed to the 
following factors: 

• pose changes: face images vary for different rotations around the camera’s 
optical axis; 

• facial expressions; 

• image conditions: lighting and camera characteristics could affect the appearance 
of a face; 

• complex background. 

Many face-location approaches have been proposed in the literature, depending on 
the type of images (gray-scale images, color images or image sequences) and on the 
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constraints considered (simple or complex background, scale and rotation changes, 
different illuminations, etc.). 

Face detection techniques can be organized in two broad categories distinguished 
by their different approach to exploit the face knowledge: feature-based and image- 
based. The techniques in the first category make explicit use of face knowledge and 
follow the classical detection methodology in which low level features are derived. 
Properties of the face such as skin color and face geometry are exploited at different 
system levels. These techniques have been studied since 1970s and many works in the 
literature refer to such approaches. The techniques in the second class address face 
detection as a general recognition problem. Image-based representations of faces, for 
example in 2D intensity arrays, are directly classified into a face group using 
approaches that incorporate face knowledge implicitly into the system through 
mapping and training schemes. 

In this work a hybrid approach is presented: it adopts typical feature-based 
techniques (template matching) in the first step and an image-based analysis in the 
second step, so that all the characteristics of the image can be exploited for face 
detection. At present the method can deal with images containing a single upright 
near-frontal face; as a future work we will extend it to more complex images. 

The rest of the paper is organized as follows: in section 2 an overview of the 
system is given, in section 3 the single system components are detailed, in section 4 
the new pattern representation technique is presented, in section 5 the results of the 
experiments are discussed and in section 6 we draw some conclusions. 



2, System Overview 

The system is based on a cascade architecture: it consists of four steps (Fig. 1): 

1. template matching: the first step consists in the application of the face detection 
algorithm presented in [12], slightly modified in some aspects, as described in 
subsection 0; 

2. edge detection and template matching: the input image is transformed into the 
frequency domain (by calculating phase congruency [10]) in order to overcome 
problems due to illumination. The first step is then reapplied to the transformed 
image. The details of this procedure are reported in subsection 3.2; 

3. false positives elimination: the candidate face images identified at the previous 
steps are analyzed and selected. A cascade of three simple classifiers is adopted 
to discard non-face images (subsection 3.3); 

4. analysis of eyes regions: an image-based analysis is carried out in order to 
identify candidate eyes and, starting from them, candidate face images. A sub- 
image centered in the supposed eye position is extracted from the original image 
and classified by a pool of six classifiers, each trained to discriminate between 
“face” and “non-face” patterns. If the final similarity score is higher than a fixed 
threshold, a face is considered detected. The procedure of analysis of eye regions 
is described in subsection 3.4. 

Not necessarily all the steps have to be performed since, as soon as a face is 

detected with a sufficient degree of confidence (determined on the basis of a 

similarity threshold), the remaining steps are not executed. 
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Fig. 1. Description of the steps of the proposed approach. 

Each module contributes to strengthen the face detector, even in presence of 
challenging acquisition conditions that can cause the failure of the method presented 
in [12]. The development of a robust system required to modify the method in [12] 
retaining all the candidate face images. Obviously this approach introduces a high 
number of false positives that have to be filtered in a dedicated step. The step of edge 
detection can help to deal with limited illumination problems, since phase congruency 
is a measure invariant to changes in image brightness and contrast. In spite of such 
modifications, the template matching based approach can still fail in presence of 
strong illumination changes when the ellipse representing the face cannot be easily 
identified. For this reason, the method has been enriched with an image-base analysis 
based on two stages: eye detection and face detection from eye regions. In the first 
stage, the candidate eyes are identified. Though the eyes represent the easiest facial 
feature to be identified, the classification of eye images is not a simple task since they 
are characterized by a high variability (e.g. closed eyes, presence of glasses, etc...) 
and many feature extraction algorithms result ineffective in obtaining a meaningful 
representation. The main element of novelty of this work is the introduction of a new 
“featureless” representation of the patterns (eye images): each pattern is represented 
by its dissimilarity from the other patterns instead of by some characteristic features. 
The description of the new representation is quite complex and detailed separately in 
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section 4. The adoption of such representation allows to noticeably reduce the 
classification error, as reported in the experimental results. The identification of 
candidate eyes allows to limit the search area of the face in the second stage, 
obtaining a lower computational complexity with respect to the methods based on 
sliding windows that require to scan the whole image. Finally the procedure of face 
detection from the eye regions employs a pool of classifiers, that have been selected 
among larger set of candidates on the basis of an error independence analysis carried 
out on a validation set. The final choice of the classifiers has been performed using 
the “disagreement criterion’’ [11]. 



3. Description of the System Modules 

3.1 Template Matching 

The algorithm proposed in [12] starts by approximately detecting the image positions 
where the probability to find a face is high and then, for each of them, improves the 
location accuracy and verifies the presence of a real face. Assuming that, when a face 
is present in an image, the corresponding directional image region is characterized by 
vectors producing an elliptical blob, in order to identify candidate positions the 
authors adopt an approach based on the generalized Hough transform. Starting from 
these candidate positions, the face is locally searched in a small portion of the 
directional image by means of a mask defined in terms of directional elements, 
describing the global aspect of a human face. 

The algorithm, as originally proposed, lacks in some aspects, due to the limited 
range of face sizes it can deal with and to the sensitivity to particular illumination 
conditions. In the attempt to strengthen this method, slight modifications have been 
introduced: 

■ a higher number of face templates has been adopted, to account for the high size 
variability that characterizes the face images; 

■ the method [12] retains only the sub-image that gains the highest similarity score 
with the stored face templates. If the resulting score is higher than a prefixed 
threshold, a face is considered detected, otherwise no face images are supposed to 
be present in the image. Some experiments show that, in particular cases (and 
mainly in presence of challenging illumination conditions), some sub-images are 
erroneously discarded among those obtaining a lower similarity score. For this 
reason we analyze all the other candidate face images. Obviously this choice 
introduces a high number of false positives, making necessary the adoption of a 
further filtering step (described in section 3.3). 

3.2 Edge Detection and Template Matching 

This step helps to deal with challenging illumination conditions that could affect the 
detection. The algorithm adopted is based on the representation of the image in the 
frequency domain, which allows to mark the features present in the image since 
image features, such as lines and edges, correspond to points where the Fourier 
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components are maximal in phase. Starting from this observation, the calculation of 
phase congruency was proposed in [13] as a technique for the extraction of image 
features. Phase congruency is a quantity invariant to changes in image brightness or 
contrast, providing an absolute measure of the significance of feature points. In this 
work the algorithm proposed in [10] is adopted; in [10] the authors show that phase 
congruency in ID signals can be calculated from the convolution outputs of a bank of 
Gabor filters, and they extend the concept to 2D images. Once the image has been 
transformed, the first step is reapplied. 



3.3 False Positives Elimination 

The first two steps presented above can create a high number of false positives, 
particularly in images with a complex background. For this reason we introduce a step 
of false positives elimination where the candidate face images are analyzed and 
selected. A cascade of three simple classifiers is adopted to discard non-face images. 
Each classifier can confirm the presence of a face or reject the input image since it 
does not get through the related control on the basis of a similarity threshold thr. The 
first two classifiers are base on two simple features presented in [16], that allow to 
reduce the number of false positives with a low computational cost. 




Fig. 2. The two features proposed in [16] and adopted in this work. 



They simply consist in the verification of a particular distribution of grey levels in 
the image (see Fig. 2) that indicate the presence of a possible face. Finally, a QDC 
classifier [2] is adopted to classify the input image as a “face” or “non-face” pattern. 
Before classification, a feature vector of low dimensionality (10 in the experiments) is 
extracted from the gray level values of the original image by applying the KL 
transform [5] in order to extract the salient image features and reduce the presence of 
noise. 



3.4 Analysis of the Eye Regions 

A step of analysis of the eyes region is performed in the last stage of the proposed 
approach: the input image is binarized and the clusters of pixels representing potential 
eyes are identified; then each cluster is classified as an “eye” or “non-eye” pattern 
and, starting from the eye clusters, a set of candidate face images are extracted from 
the original image and classified by a pool of classifiers as a “face” of “non-face” 
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pattern. In order to reduce the computational complexity of the method, the candidate 
eyes are not searched over the whole image, but the map of the Hough accumulator 
determined at the first step is considered to restrict the search area to the regions 
presenting values higher than a prefixed threshold. 

In order to detect faces at different scales, the method requires the definition of the 

minimum face height ( /?™,“ ) and width ( ) expected and the number n of scales 

to be analyzed. The height (h eye ) and width (w eye ) of the eye are calculated as: 

" eye — d.2 • " face w eye ~ 0-25 • Wj- ace 

The procedure of eye detection will be detailed in subsection 3.4.1, and in 
subsection 3.4.2 the approach of face detection from the eye regions is described. 

3.4.1 Eye Detection 

The aim of this stage is to detect the presence of eye patterns in the image. We search 
for a single eye instead of a couple of eyes since, in presence of particular 
illumination conditions, one of the eyes could not be visible. 

The input for this step is the input image I and the grey level image H representing 
the values of the Hough accumulator determined in the first step of the method; 
brighter intensities represent higher values of the Hough accumulator whose values 
have been normalized between 0 and 256. The image H is binarized, setting to 1 all 
the pixels having a grey level value greater than a prefixed threshold th / and setting to 
0 the others. The binarized image, is then used to filter and binarized the image I, 
according to the following formula: 

1 1 if H(i, j) = 1 and l(i,j) > th 2 
’ ! [0 otherwise 

where th 2 is the threshold used for binarization. 

The result is a new binarized image where a set of clusters (a set of connected 
pixels having the same value) can be identified. In Fig. 3 the input image and the 
images obtained at different filtering stages is shown. 



/ is. 






■fr. 










(a) 


(b) 


(C) 


(d) 



Hough transform filtered original binarization 
accumulator (H) image of image H 



Fig. 3. The input image (a) and the images obtained at different filtering stages: Hough 
accumulator (b), filtered original image (c) and its binarization (c). 
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The candidate eye clusters are selected by applying some simple heuristic criteria: 

• each cluster must contain between 10 and 400 pixels (too big and too small 
clusters are discarded); 

• clusters having a ratio between height and width greater than 2 are discarded; 

• there must be no more than 5 other clusters in a range of 25 pixels (the number of 
false positives due to complex backgrounds can be reduced). 

A rectangle of size h lye by w eye , aligned with the center of each cluster, is extracted 
as a candidate image of an eye. A feature vector is extracted from the resulting image 
by applying the Karhunen-Loeve transform (KL) [5], and an additional filtering step 
is performed by classifying the clusters as “eye” or “non-eye” patterns by means of a 
QDC classifier [2]. The dimensionality of the feature vector has been fixed to 4 in the 
experiments; the same results have been obtained with higher values. 

If the similarity with respect to the class “eye” is sufficiently high, the cluster is 
subjected to a final selection step. Some preliminary experiments showed that the 
intrinsic dimensionality of the images representing single eyes is very low; moreover 
we experimentally verified that, in such reduced subspace, the patterns are not 
sufficiently scattered, making very difficult the classification task. For these reasons 
we introduce a space transformation, by adopting a featureless representation of the 
patterns. Such transformation represents the most original contribute of this work and 
will be detailed in section 4. The patterns in the new space are more scattered 
resulting in an easier classification task. The new approach to pattern representation 
will be detailed in the following section. The “transformed patterns” are then 
classified by a simple TNN algorithm as “eye” or “non-eye” patterns. 

3.4.2 Face Detection from Eye Regions 

Starting from each eye candidate position, six subimages are extracted from the 
original image: 

■ n different scales are considered to detect faces also in presence of large variation 
of their dimensions within an input image. The initial dimension is defined by 
two parameters hf ace and Wf ace (height and width of the face respectively); the 
other n- 1 images are selected by increasing the scale factor of 1.25 each (in Fig. 4 
an example of three images at different scales is reported). In the experiments the 
parameters have been fixed to hf ace = 95, Wf ace = 58, n = 3. 

■ for each of the n scales, two subimages are selected supposing that the eye cluster 
is the left eye of the face or the right eye respectively (Fig. 5). For the extraction 
of the face image the same ratio between eyes and face assumed in [12] is 
adopted. 

The six rectangles obtained represent candidate face images and are successively 
classified by a pool of six classifiers: 

■ two simple classifiers based on some of the features proposed in [16] and adopted 
in the step of false positives elimination (see section 3.3); 

■ two simple classifiers, QDC and LDC [2]; 

■ two more complex classifiers based on Support Vector Machine [15], adopting 
respectively a polynomial and a Radial Basis Function kernel. 
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Fig. 4. Three images at different scales extracted from the same image. The analysis at different 
scales allows to detect faces also in presence of large variation of their dimensions within an 
input image. 




3 

candidate 

face 

images 

»9 



Fig. 5. Extraction of the candidate face images on the basis of the eye clusters identified by the 
eye detector. Two subimages for each of the three scales considered are extracted, one 
supposing that the eye is “right” and the other supposing that the eye is “left”. 

The last four classifiers analyze a feature vector obtained from the candidate face 
image by applying a KL transform [5]. Please note that the most complex classifiers 
(SVM, in the lower levels of the pool of classifiers) are applied to a very limited set of 
images as the first simple classifier in the cascade are able to eliminate most of the 
false positives. 

Each classifier is trained to distinguish between “face” and “non-face” patterns. If 
the similarity of the pattern to the class “face” is higher than a prefixed threshold, the 
image is passed as input at the next classifier. If all the classifiers give a positive 
result, a face is considered detected by the system. 
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4. Featureless Representation of the Patterns 

A new featureless representation of the patterns is proposed: each pattern is 
represented by its dissimilarity from the other patterns. This strategy has been 
considered as an appealing alternative to feature-based representation in recent works, 
e.g. [14] [4] [3], since it can give good results when a feasible feature-based 
description of objects is difficult to obtain or inefficient for learning purposes. 

For each pattern, a mapping O is calculated from the pattern space s Jf'' to a new q- 
dimensional space. Each component <D, of the mapping function can be viewed as a 
distance function d from a pattern x e 5?" to the decision surface of the region V F 1 

d>(x) = (d(\, T'i ), d(\, TG ),..., d{\, K Y q | 

where 'P = { v Fi, v F 2 ,..., X Y q ) is a set of bi-dimensional regions, where the two 

dimensions are extracted from the original feature space. The procedure for the 
definition of such regions will be detailed in the following. 

Our method employs fuzzy rules [9] in order to choose, among all the possible 
combinations, a set of bi-dimensional regions of interest. Given a set X of n- 
dimensional vectors, each rule R, has the following structure: 

Rule Rf IF xi IS A,, AND . . . AND x n IS A m THEN Cj IS y with x e X 

where: 

• Aj b i = 1 is a linguistic variable in the set {vj,V 2 ,..,v i }U {"don't care"} (the 
value “don’t care” for a variable means that the consequent class does not depend 
on the value of that particular variable). 

• Cj e {l ,..,nc) is the consequent class (in this work Cj e {l,2} where 1 is the class 
of “eye” and 2 represents “non-eye”); 

• y is the compatibility grade calculated as /Vy (x ) = jj,j\ (jq ) * . . . * /a j n (x n ) , where 
fi ji (•) is the membership function of the antecedent linguistic value Aj t , 
calculated as in [9]. 

The algorithm for the derivation of the new representation can be summarized in 
the following steps: 

1. The complete set of fuzzy rules is created with all the possible combinations of 
values of the 5 linguistic variables (s = 4 in the experiments), for the n features of 
the patterns. 

2. The rules are sorted according to their “goodness” and only the first k are 
retained. For each rule R r the patterns of the training set are ranked on the basis 
of their compatibility grade given by R,. The sequence of patterns is split into two 
runs: the split position is given by the number of training samples belonging to 
the consequent class Cj. The patterns of the first run really belonging to c, and the 
patterns of the second run not belonging to Cj are retained into a set Sj. The 
cardinality of Sj quantifies the “goodness” of the rule R r 

3. For each rule R, of the k retained, some two-dimensional datasets are created by 
projecting the elements of Sj into two selected dimensions (we use each couple 
(i i, i 2 ) of linguistic variables in the rule with a value different from “ don 7 care”). 
Each of these new two-dimensional datasets represents a classification problem. 
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We indicate with q {q < {k x n x (n - 1) / 2) ) the total number of classification 
problems. 

4. For each classification problem, the two-dimensional dataset is adopted as 
training for a Radial Basis Function [6] classifier. The output of the classifier is a 
decision surface m p associated to the classification problem. These regions 

represent the q “candidate” axes of the new featureless space. 

5. For each pattern x in the training set, calculate the mapping from the pattern 
space to the new (/-dimensional space. We define the distance c/(x, 'F /; ) of a 

pattern x (projected into the region space) from a class region 'F p as the length 

of segment classified as belonging to the class (in Fig. 6 a graphical 
representation of such distance is reported). The segment is perpendicular to the 
/-axis and the value in the /-axis is the value of /-th feature of x. 

6. Select only the best d (d is a parameter) dimensions by using a simple feature 
ranking technique which selects the features that maximize the distance between 
the centroids of the different classes of patterns. 

7. 




Fig. 6. A graphical representation of the distance c/(x, 'Pp ] of a pattern x (projected into the 
region space) from a class region v F p . 



5. Experimental Results 

The experimentations have been performed on two databases: 

■ Yale B [17]: the database contains 5760 single light source images of 10 subjects 
each seen under 576 viewing conditions (9 poses x 64 illumination 
conditions). For every subject in a particular pose, an image with ambient 
(background) illumination is also captured. We selected the frontal and near- 
frontal images from this database (1080 images). Some example images are 
reported in Fig. 7. 

■ BioID [8]: it consists of 1521 images with human faces, recorded under natural 
conditions, i.e. varying illumination and complex background (Fig. 8). 
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Fig. 7. Some example images from the YaleB database. 




Fig. 8. Some images from the BioID database. 

The output of the detection system is the window containing the supposed face. 

The evaluation of the detection performance are calculated on the basis of: 

■ False Positives: percentage of hypothesized face windows that do not contain the 
actual face; 

■ Missed Faces: percentage of images where the system has been unable to find a 
face; 

■ C-Error: percentage error calculated as the Euclidean distance between the real 
and the supposed face center, normalized with respect to the sum of the axis of 
the ellipse containing face; 

Table 1. Comparison of the new approach with [12] on the YaleB (frontal pose) database [17]. 



Algorithm 


False Positive 


Missed faces 


C-Error 


New approach 


8.9% 


10.3% 


23% 


[12] 


10% 


15.7% 


35% 



In Tab. 1 and Tab. 2 the results of both the method [12] and the new approach are 
reported for the YaleB and BioID database respectively. The classifiers adopted in the 
new approach have been trained using disjoint training and test sets. A face database 
internally collected has been used as training set; the same database has been used for 
the definition of the different thresholds required by the method. 
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Table 2. Comparison of the new approach with [12] on the BioID database. 



Algorithm 


False Positive 


Missed faces 


C-Error 


New approach 


5.99% 


8.6% 


10% 


[12] 


6.83% 


13.1% 


20% 



The experimental results show that the new approach allows to drastically reduce 
the percentage of missed faces reducing at the same time the number of false positives 
and the error in the estimation of the face center. 

The experiments show that the feature transformation approach presented in this 
work can drastically reduce the classification error: the eye detection subsystem 
achieves an error rate of 17.87%, against the 25% of the Ar-NN classifier in the 
original space. 



6. Conclusions 

In this paper an approach to upright frontal face detection has been presented. This 
approach is based on a previous work [12] that has been extended with three 
additional steps to improve the detection performance. The preliminary experiments 
carried out give encouraging results, showing that the additional steps can help in 
overcoming common illumination problems, allowing to obtain a noticeable 
performance improvement. The percentage of missed faces has been reduced from 
10% to 8.9% on the Yale B database and from 6.83% to 5.99% on the BioID database 
(about 35% improvement on both the databases). Also the error in the estimation of 
the face center has been drastically reduced: from 35% to 23% on the Yale B database 
(35% improvement) and from 20% to 10% on the BioID database (50% 
improvement). Moreover the experimental results show that the new pattern 
representation allows to improve the classification performance of the eye detection 
subsystem: the error has been reduced from 25% to 17.87% (about 29% 
improvement). 

Many aspects of the method could be further optimized. As to future research, we 
intend to extend the method to deal with more complex images containing several 
frontal and non-frontal faces. 
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Abstract. We present a face verification system with an acceptance threshold 
automatically computed. The user is allowed to provide the rate between the 
costs assumed for a false acceptance and false rejection. This rate between costs 
can be intuitively known by the system responsible and are a starting point to 
fulfil user security requirements. With this user-friendly data, an algorithm 
based on screening techniques to compute the acceptance threshold is presented 
in this paper. This algorithm is applied to an original and competitive face 
verification system based on principal component analysis and two classifiers 
(neural network radial basis function and support vector machine). 
Experimental results with a 100 people face database are shown. This method 
can be also applied into other biometric applications in which this threshold 
should be calculated. 



1 Introduction 

Biometrics technology has passed in few years from research labs to commercial 
implementations. Media coverage has brought face recognition systems used in high 
profile locations such as airports, to the attention of the public. Unfortunately, the 
recognition of the human face is a very complex problem involving several 
processing steps that have not yet been completely resolved. Although technology is 
evolving and obtaining better results, expectations are very high and in most cases, 
difficult to achieve. As a consequence, several systems tested in real conditions have 
been rejected. 

However, less attention has been paid to control access systems. In these systems, 
the effect of the environment is more controlled, allowing the technology to obtain 
better and more reliable results. Such systems could fulfil the performance criteria 
demanded by potential clients. 

The experiment presented in this paper focused on testing the performance of a 
control access system based on face verification technology. In control access 
environments, it is possible to take advantage of a set of specific characteristics. 
Usually, the subject is in front of the camera, only one subject appears, the size of the 
face is more or less constant and the subject is usually collaborative. It is therefore 
possible to obtain an initial set of images and to define a personal identification 
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number entered or placed in a smart card. Our system uses these advantages and 
proposes a control access system designed to work in such situations. 

In recent years, two main approaches to face processing problem using only image 
information have appeared. The first approach is Principal Components Analysis 
(PCA) and related methods such as Fisherfaces [1] [2] [3] [4]. These methods 
consider only the global information of the face. Likewise, methods based on Local 
Feature Analysis (LFA) [5] [6], similar to PCA, consider different kernel functions 
which concentrate local features, such as eyes, mouth and nose. In this case, selection 
of facial features and kernels is an open issue. The second approach, based on Elastic 
Bunch Graph Matching (EBGM) [7] and similar methods, use wavelet transformation 
to obtain local description of the face and a graph to obtain a global face description. 
In the scientific literature several results with different research algorithms have been 
published. For example, following the success of FERET tests [8] [9] [10], a recent 
and extensive test of ten commercial products has been performed (FRVT 2002) [11]. 

A continuing problem in the design of a facial verification system is the decision of 
the optimum acceptance threshold. The acceptance threshold is the value that 
determines whether a verification is acceptance or rejection. For example, in the SVM 
classifier, the threshold is w = 0 . However our experience shows that choosing a 
different value could result in a better performance of the system, this is, in a smaller 
number of false acceptance and false rejection. We understand that the acceptance 
threshold should then be chosen to minimize the error rate. 

Furthermore, it is important to note that in a facial verification system there are two 
different error types; false acceptance and a false rejection, each with, possibly, 
different associated risk. For instance, in high security environments it is highly 
recommended to minimize the false acceptance rate despite the fact that the false 
rejection rate could be increased (subject has to key maybe twice the code). Likewise, 
for the access to a non-critical place, a higher false acceptance rate could be 
acceptable and the false rejection rate could be lowered (impostors could be accepted 
but to gain access, the code only has to be typed once). In order to take this into 
account, we propose a classification system based on costs for false acceptance and 
false rejection. The exact calculation of both costs (acceptance and rejection) could be 
difficult to found, but the rate between this costs is easier to fix. This is the input in 
the algorithm proposed. 

In this paper we present an innovative algorithm to calculate this optimal 
acceptance threshold by using economic screening techniques based on different costs 
for different error type. 



2 Experimental Set Up Description 

The set up has been designed and built to test the performance of the algorithm. 
Figure 1 shows the image acquisition set up, consisting on two diffuse light sources 
placed on both sides of a video camera. 

In order to minimize distortions originated by changes in the lens focal length and 
the camera-subject distance, it is advisable to fix both in any operation environment. 
These requirements are easily met in any exploitation site. In our experiments a 
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database of 100 individuals is considered. Subjects were forced to change their pose 
between the acquisition of two consecutive images. 





Fig. 2. Examples of the Face Database 



An image size is 320 x 240 pixels with face covering great part of the image (as 
shown in figure 2). Our face location system cropped the face to a window of 
130x140 pixels. Eight images per subject were used for computing PCA matrix and 
training all classifiers. For tests sets, four different images per subject were 
considered. 



3 Face Verification System 

Face verification can be split into four processes: Face location, PCA computation, 
classifier design and automatic optimal threshold calculation. The first three parts 
require a training or parameter computation phase and once all parameters have been 
adjusted and classifiers trained, a normal operation phase. This fourth process will be 
detailed in chapter 4. 



3.1 Face Location 

In this step, the image is the input and the desired output is a window containing only 
the face in a standard size. The background is then eliminated to obtain a rough initial 
estimate of face location in the image. Subsequently, convolution with a face template 
is applied to obtain a more reliable and precise position of the face. Each subject in 
the database has their own template. The template is part of the subject’s face, so 
convolution is more reliable where template coincides with the face in image. Initial 
tests suggest that one template per subject achieves better performance that one 
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template for the whole database. When the convolution reaches the maximum over 
the images, a window containing the face is extracted. The final dimension was 
reduced to 130 x 140 pixels. In this step all images were also converted from colour to 
a grey scale. 



3.2 Principal Components Analysis Computation 

Principal Components Analysis is the de facto standard in face verification systems. 
In the training phase, the problem can be resolved computing the transformation 
matrix using a number of eigenvectors that retains almost 100% of the initial 
variance. Only one PCA matrix is computed with the training face images set. In our 
experiment eight images per subject are considered in order to compute the PCA 
matrix, in our tests 150 eigenvalues were considered. 

3.3 Verification 

Two classifiers have been considered: Artificial Neural Networks: Radial Basis 
Function (RBF) and Support Vector Machine (SVM). In all cases, training is 
performed with eight images per subject (the same ones used for PCA computation). 
Tests were carried out using four images per subject. Training and test sets did not 
overlap. If the output value for SVM and RBF is large this means that confidence is 
high. Thus positive verification has been considered when output value is greater than 
the acceptance threshold. This acceptance threshold has to be set to obtain the 
optimum value that minimizes false acceptance rate and false rejection rate, and 
maximizes the correct rate. The magnitude used as threshold is different for each 
classifier, in case of RBF, output neuron value and SVM: function decision value 
RBF has been used as an artificial neural network classifier for face verification. 
The initial information is a subject image and personal identification number (PIN) 
code. The PIN code indicates which output neuron is considered. In our experiment, 
Gaussian functions considered are symmetric and centred in the middle of each face 
subject cluster. 

Support Vector Machine offers excellent results in 2-class problems. This classifier 
could be easily used in verification problems (recognizing one subject against rest). In 
our experiment a linear kernel has been considered. 



4 Optimal Acceptance Threshold Calculation 

In order to optimize the acceptance threshold, we perform a Bayesian screening 
approach [12] based on two variables, namely 

• A binaiy performance variable T , identifying whether one image has been 
taken ( T = 1 ) or not ( T = 0 ) of a given person. 

• A screening variable X defining the output of a known classifier, for instance, 
SVM or RFB. 
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Since the screening variable X is not perfectly correlated with the performance 
variable, decisions made by using the screen are prone to error (false acceptance and 
false rejection). 

4.1 Economic Design of the Screen 

Suppose that our screening variable X is continuous and of the type the larger the 
better. That is, a large value of X tends to indicate a matching image or genuine 
( T = 1 ), whereas a small value ofX is a sign of an impostor ( T = 0 ). 

Under such an assumption, a single-stage screen based on the screening variable, 
would naturally contain a cut-off point w , so that if X is above w , we accept the 
person as genuine, and if X is below w , we do not. Observe that if X = w there is 
an arbitrary choice between accepting and rejecting the person. From now on and in 
order to be consistent, we shall accept items for which X = w , so that the screen is 
precisely defined as 

• if X > w , the person is accepted. 

• if X < w the person is rejected. 

4.2 Optimal Acceptance Threshold 

We adopt an economic objective in which the value of the threshold w is determined 
in order to minimize the expected total cost of the procedure. Let c a and c r be the cost 
paid for a false acceptation and a false rejection by the system, respectively. The 
expected total cost of an image being classified based on the output of a classifier 
system such as SVM or RBF, may be expressed as a function of w , so that 

ETC(w) = c r P( wrongly reject image) + c a P( wrongly accept image). 

In formal notation, 

ETC(m>) = c r P(T = 1,X < wj + c a P(T = Q,X > w), 
which, assuming X is continuous, becomes 

W oo 

ETC{w) = c r | P(T = l\X = w) f (x)dx + c a J[l - P(T = 1 1 X = w)] f{x)dx, 

—CO w 

where / (x) is the marginal density function of the screening variable X . 

To minimize this expected total cost for continuous X , we differentiate this 
expression with respect to w , and equate to zero, 

ETC'(w) = c r P{T = \\X = w)f{w)-c a [\-P(T = \\X = w)]f{w). 

Defining 



k = 



c, +c, 
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it is then straight forward to show that the equation 

P(T = \\X= w) = k, (1) 

gives the optimal value w for the acceptance threshold. Note also that, by defining 
the rate k , there is no need to state the value of the costs c a and c r . The user may 
just give the rate k , which should be easier that fixing the costs. 

In order to identify the optimal limit for the first stage of the screen we need to 
solve equation (1) and, hence, to evaluate expressions of the form P(T = 1 1 X - x) . 
It is necessary, therefore, to take into account the structure defining the relationship 
between X and T . 



4.3 The Model 

The structure for (X,T) is usually expressed as a parametric model with unknown 
parameters# . We denote the joint probability model for (X,T) given# by f(x,t j #) 
and try to obtain the unconditional model f(x,t) by using the available information 
about the parameters. There are two main approaches for this purpose: the estimative 
or classical approach and the predictive or Bayesian approach. Here we shall adopt a 
Bayesian approach, as it provides a natural but also rigorous theory for combining 
prior and experimental information as well as for making inference. 

We now propose the factorisation of the joint distribution of ( X,T ) through the 
conditional model for the continuous screening variable given the value of the 
performance variable. We also specify the distribution ofX for genuine and impostor, 
separately, so that 

f(x, t\0)=f(x\T = \,p l , of )P(T = l\p) + f(x\T = 0,ju o , of )P(T = 0 1 p), 

where # = (//,,of ,p a ,o\p) and with (//,,cr 2 ), (// 0 ,af) and p independent. 

Remember that T is a binary performance variable, taking values T = 1 if a 
photograph match subject identity and T = 0, otherwise. Its marginal distribution 
may, therefore, be defined by 

P(T = 1) = p, 

P(T = 0 ) = 1 -/?, 



where p is the probability of success and hence, satisfies 0 < p < 1 . 

Let us then assume that variable X follows a normal distribution with parameters 
(//, , cr, 2 ) and (//„ , cr, 2 ) in each group, this is, 

X\T = i~N(ju i ,af), 



for i = 0.1 , respectively. 

Here we are interested in the conditional probability of an item with screening 
value X = x being successful. By using Bayes theorem, this is, 
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P(T = 1 1 X = x, data ) = 



fix | T = 1, data)P(T = 1 1 data ) 

7, fix \ T = z, data)P(T = z | t/ato) 
1 = 0,1 



(2) 



The conditional posterior predictive densities fix \ T = i, data ) for / = 0.1 and the 
posterior predictive probability of a success P(T = 1 1 data) are both developed by 
using the Bayesian approach, both assuming non — informative prior distribution for 
the unknown parameters, see, for instance [12]. 

The predictive posteriors of X \ T = i , are found to be Student-t distributions with 
density functions, 



fix | T = i, data) oc 




(•v-A-y | 11 

(«, - 2 )Pi) 



where p t = (1 + n . 1 )sf and where x t , s. and n. are the sample mean, sample standard 
deviation and sample size for each one of the two different groups, i = 0, 1 , this is for 
genuine and impostors. 

In developing the posterior probability of an image matching subject identity 
P)T = 1 1 data ) , it is of interest to recognize that the number of successes and the 
number of failures n g have been chosen in advance, and that no additional information 
about the probability of success is therefore provided by the data. Thus we set a non 
informative prior for the parameter p which results in equivalent posterior predictive 
probabilities for genuine and impostors, this is 

P)T = 1 1 data) = 

Once all the elements in expression (2) have been developed, optimal values of the 
acceptance threshold w are easily calculated by employing numerical techniques. 



5 Results and Discussion 

The results are presented in two stages. Firstly we shall present the optimal 
acceptance threshold calculation for different acceptance and rejection costs rates, this 
is for different values of the constant k . Secondly, we shall show the variation of 
FRR and FAR in each cost case. 

Exploratory analysis of the data shows that the screening variable X is continuous 
and of the type the larger the better, as required by the our screening set-up, with 
sufficient statistics given in Table 1. 
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Table 1 . Sufficient statistics for genuine and impostor for the SVM and RBF classifier. 







SVM 




RBF 




x i 


S i 


U i 


X,. 


S i 


n i 


T = 1 


4.009 


1.735 


400 


0.828 


0.340 


400 


o 

II 


0.306 


0.277 


39600 


-1.696 


0.455 


39600 



In order to see how the acceptance threshold w changes with the different values 
for c a and c r , we compute optimal values of w corresponding to different values of 
the constant k , 0 < k < 1 , for the two different classifiers, SVM and RBF. The results 
are shown in the following graphs, 




0 2 0 4 00 0 8 



Fig. 3. SVM Optimal threshold 




0 2 04 08 08 



Fig. 4. RBF Optimal threshold 



Recall that k = c a /(c a +c r ) , we now consider three specific values for the constant 
k which may be identified with three different security levels of access control or 
situations, in which our face verification system may be applied. 

A low security level system: In our set-up, this situation might be identify by using 
an acceptance cost much smaller than the rejection cost. By assuming c a = 0.1c,. , for 
instance, we obtain k = 0.090. In this situation the system is will not be very 
restrictive and the FRR is forced to be very low. This security level could be applied 
in a supervised parking access control, when it is important to avoid a traffic jam. 

A medium security level system is represented with equivalent rejection and 
acceptance cost, this is the case where we assume that a false acceptance is as 
dangerous (or expensive) as a false rejection. Note than then k = 2 . 

A high level security system: This could be represented by using an acceptance 
cost much more expensive, than the cost of rejection. For instance if we assume that 
c a = 10c,. , the value of k turns to be k = 0.909 . In this case the FAR is nearly zero, 
for the RBF classifier, and null for the SVM classifier (even thought that FRR could 
be high). This system is highly restrictive and it could be applied to access control 
where we are interested in avoiding impostors to enter. 

Table 2 shows the optimum acceptance threshold in three different cases: low, 
medium and high security level. Note that FAR decreases as security level 
(acceptance cost ) increases. 
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Table 2. Optimum Acceptance Threshold variation with FAR and FRR in each case. 



c - c 






SVM 






RBF 




rate 


k 


W 


FRR(%) 


FAR(%) 


w 


FRR(%) 


FAR(%) 


c„ =0.1c r 


0.091 


-0.717 


1 


0.17 


1.081 


2.00 


2.45 


c a = c r 


0.500 


- 0.366 


1 


0.01 


1.527 


7.21 


0.32 


c a =10 c,. 


0.909 


0.001 


3.50 


0 


1.888 


11.72 


0.31 



Figure 3 shows the FRR and FAR for a wide variation of optimal acceptance 
thresholds. These results are presented in a conventional DET curve [13], which plots 
on a log-deviate scale the False Rejection Rate (FRR) as a function of the False 
Acceptance Rate (FAR). We present a DET curve of each classifier: SVM and RBF. 
The point of the DET curve corresponding to FNR = FPR is called Equal Error Rate 
(EER). While EER may not be useful in real world applications, it could be helpful in 
comparing the performance of systems or algorithms. 



100 




1 10 100 

FAR(%) 



Fig. 5. DET curve 

In this figure we can see how the SVM classifier is more reliable than RBF. If we 
consider the EER as a measure of the system performance the superiority of SVM is 
clear: EER(SVM)=0.99 and EER(RBF)=2.43. 



6 Conclusion 

In this paper we have presented a reliable face verification system with an innovate 
module; automatic evaluation of the optimal acceptance threshold using Bayesian 
screening techniques. This assure that the security level is under control while 
keeping a minimum error levels. 

Using the algorithm proposed, the user is allow to provide the cost that is assumed 
to pay for false acceptance or false rejection. This allows the tailoring of our system 
to user security requirements. Furthermore, the user may indicate the value of the 
level of security required in an intuitive way, and parameter computation is hidden to 
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the user. The system proposed can work under several security conditions that can be 
changed by the user. 

The method proposed is valid for all face verification systems, independently of 
the classifier. Its integration in an existing system has been performed and results 
show that integration of the algorithm is not expensive. 

It is of interest to note that a face verification system may be adapted to the 
environment and the specific conditions of the future application in order to obtain 
satisfactory results. 
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Abstract. We apply the methods of the theory of elasticity for two im- 
portant problems in fingerprint based authentication: (1) registration of 
deformations up to the level of pixel-wise correspondence of two images; 
(2) parametric modeling and exact measurement of the natural deforma- 
tions. The approach is based on the numerical solution of Navier linear 
PDE, the registration being provided even for the cases of significant 
losses and errors in the initial correspondences of minutiae that may be 
caused by various noise and distortion factors. Relatively compact and 
theoretically grounded model of the deformations is proposed, which al- 
lows to obtain the estimations of discrepancies in the most extreme cases. 



1 Introduction 

The basic distortion factors that negatively affect the performance of fingerprint 
verification are as follows [1,2] : small area of intersection, bad quality of the input 
images, and elastic deformations (ED). The first factor is rather subjective and 
could be avoided (by the positioning of the cuticle, for example) either by a user 
which is friendly to the verification system, or by an operator in the case of AFIS 
or civil AFIS systems [2]. Therefore, only two aspects of interest are left: noises 
and ED. 

Once there is a distorting factor, two approaches can be involved: making 
the algorithm to be invariant to the factor or suppressing its influence. Since the 
spatial spectrum features of fingerprint minutiae are very close to the character- 
istics of the typical noises (smudging and loss of ridge segments) the most of the 
existing technologies [3-7] prefer to use suppression of distortions and recovery 
of the papillary lines (ridges and valleys). This choice results in the limitations 
on SNR, e.g. as it has been constructively shown in [8] even 50% white noise 
creates the areas that subjectively can be taken for minutiae. Considering sep- 
arately the factor of ED, it is still not evident that ED-invariant algorithm, to 
say, the one which uses inter-ridge counting, provides worse performance being 
compared with one which uses metrical matching, but one reason in favor of 
ED suppression could be adduced: the associative matrices introduced by T. 
Kohonen[9] and modern neural networks allow to provide reliable identification 
with very high noise levels (up to 99%), i.e. the reducing of fingerprint matching 
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to the rigid movement (procrustean matching) is reasonable at least when both 
factors are present. 

In spite of existence of developed theory of elastic deformations, it is rarely 
applied to the real-time systems due to computational complexity. 

There are a number of approaches to registration of elastic deformations. One 
of the first approaches was introduced by D.J. Burr [10], and used the concept of 
rubber masks. The way suggested by A.M. Bazen and S.H. Gerez [11] is based 
on the tlrin-plate spline (TPS) models, firstly applied to biological objects by 
F.L. Bookstein [12]. This method requires determining correspondent points in 
two compared images (matching point) and it suffers from the lack of precision 
in case of few matching points. Modifications of TPS (approximate thin-plate 
splines and radial based function splines) were introduced by M. Fornefett, K. 
Rohr and H. Stiehl [13] , [14] . They consider deformations of biological tissues. But 
this way also requires many matching points (more then 100) what is virtually 
impossible in fingerprint applications, because number of minutiae in fingerprint 
image rarely exceeds 50. This fact makes TPS and its variants hardly applicable 
to fingerprint deformations registration. 

The absolutely different approach was suggested by R. Cappelli, D. Maio and 
D. Maltoni [15]. They developed analytical model of fingerprint deformation. But 
it has some shortcomings, for example, irreversibility even of small deformations. 
However, it was one of a few, if not the only one, work where a parametric ED 
model had been introduced. In [16] we have proposed an algorithm of restora- 
tion of deformations knowing correspondent points in two images, based on the 
numerical solution of Navier PDE by finite elements method (FEM) with the 
examples of its implementation and statistical analysis of the distribution of de- 
formation energy for the existing available fingerprint databases [17,18]. Here, we 
consider one more approach to the problem of registration based on the convolu- 
tion with pulse responses. We also propose more compact scheme for parametric 
description of ED, which allows to make the estimation of discrepancies in ex- 
treme cases. 

2 Model of Elastic Deformation 

In general the dynamics of a small elastic deformation is considered to satisfy 
the Navier linear elastic PDE: 

Lu(x, y, z, t) = —f(x, y, z), (1) 

where L is the following differential operator: 

q 2 

L = /iV 2 + (A + p.) Vdiv — j ( 2 ) 

u is the vector of displacement; f is the external force. Coefficients A and y are 
the Lames elasticity constants. These parameters can be interpreted in the terms 
of Youngs modulus E and Poissons ratio v 




